GFA

class gratools.GFA(gfa_path, threads=1, logger=<factory>, gfa_name=None, version=None, header_gfa=<factory>, sample_reference=None, bam_segments_file=None, dict_samples_chrom=<factory>, dict_segments_size=<factory>, dict_segments_samples=<factory>, dict_samples_bed=<factory>, works_path=None, bed_path=None, bam_path=None, found_minigraph=False, index_links=False, db_links=None, segment_count=0, total_segment_length=0, link_count=0, degrees=<factory>, walks_count=0, max_walk_rank=0, sum_rank0_length=0, input_genome_size=0, walks_info=<factory>, inverted_links_count=0, negative_links_count=0, self_links_count=0, isolated_segments=<factory>, shared_executor=None, progress=None, line_type_counts=<factory>, header_gfa_file=None, stats_file=None, db_file_path=None)[source]

Bases: object

Manage parsing of a GFA (Graphical Fragment Assembly) file, compute statistics, and generate related files such as BAM-containing segments files and BED files for path per sample. Fill an asynchronous database for links and asynchronous BED writer.

Attributes

gfa_pathPath: Path to the input GFA file (can be .gfa or .gfa.gz).
threadsint, optional: Number of threads for operations like BAM file processing. Default is 1.
loggerlogging.Logger: Logger object. Default is a logger named “GraTools”.
gfa_nameOptional[str]: Name of the GFA file derived from gfa_path (without extensions). Auto-initialized.
versionOptional[str]: GFA version extracted from the header (e.g., “1.0”). Auto-initialized.
header_gfaList[str]: List of header lines (H lines) from the GFA file. Auto-initialized.
sample_referenceOptional[str]: Reference sample name, potentially from GFA header (RS tag). Auto-initialized.
bam_segments_fileOptional[Path]: Path to the BAM file where segments (S lines) will be written. Auto-initialized.
dict_samples_chromdefaultdict[str, OrderedDict[str, List[str]]]: Map sample names to an OrderedDict of chromosome names, which in turn map to a list of “starttstop” fragment strings derived from Walk (W) lines. Auto-initialized.
dict_segments_sizedefaultdict[str, int]: Map segment IDs and their length (in base pairs). Auto-initialized.
dict_segments_samplesdefaultdict[str, List[str]]: Map segment IDs to a list of sample identifiers (“sample;chromosome;haplotype”) that contain the segment. Auto-initialized.
dict_samples_beddefaultdict[str, OrderedDict[str, Path]]: (Note: This attribute is not directly populated by the current parsing logic. It was likely intended to track the paths of generated BED files. The AsyncBedWriter internally manages these paths. This attribute might be redundant or used for post-processing tracking). Auto-initialized.
works_pathOptional[Path]: Path to the working directory (e.g., “…/{gfa_name}_GraTools_INDEX”). Auto-initialized.
bed_pathOptional[Path]: Path to the subdirectory for BED files within works_path. Auto-initialized.
bam_pathOptional[Path]: Path to the subdirectory for BAM files within works_path. Auto-initialized.
found_minigraphbool: Flag indicating if a sample named ‘MINIGRAPH’ (case-insensitive) was found in Walk lines. Defaults to False.
index_linksbool, optional: If True, GFA links (L lines) are stored in an SQLite database. Defaults to True.
db_linksOptional[AsyncGfaDatabase]: Asynchronous database handler for GFA links. Auto-initialized if index_links is True.
segment_countint: Total number of segments (S lines) processed. Defaults to 0.
total_segment_lengthint: Sum of lengths of all segments. Defaults to 0.
link_countint: Total number of links (L lines) processed. Defaults to 0.
degreesdefaultdict[str, int]: Maps segment IDs to their degree (number of links connected). Defaults to an empty defaultdict.
walks_countint: Total number of walks (W lines) processed. Defaults to 0.
max_walk_rankint: Maximum number of segments in any single walk. Defaults to 0.
sum_rank0_lengthint: Sum of lengths of the first segments of all walks. Defaults to 0.
input_genome_sizeint: Cumulative size of all paths (sum of segment lengths along each walk). Defaults to 0.
walks_infoList[Dict[str, Any]]: List of dictionaries, each containing info for a walk (Path name, Sequence length, Num Segments). Defaults to an empty list.
inverted_links_countint: Count of links where orientations differ (e.g., S1+ -> S2-). Defaults to 0.
negative_links_countint: Count of links where both segments have negative orientation (S1- -> S2-). Defaults to 0.
self_links_countint: Count of links where a segment links to itself (S1 -> S1). Defaults to 0.
isolated_segmentsSet[str]: Set of segment IDs that have no links connected to them. Initialized with all segments, then linked ones removed. Defaults to an empty set.
shared_executorOptional[ThreadPoolExecutor]: Executor for running synchronous tasks in threads. Auto-initialized.
progressOptional[Progress]: Rich Progress instance for displaying progress. Auto-initialized.
line_type_countsCounter: Counts of each GFA line type (H, S, L, W, P, C, E, U). Auto-initialized.
header_gfa_fileOptional[Path]: Path to where the GFA header is saved. Auto-initialized.
stats_fileOptional[Path]: Path to where GFA statistics are saved. Auto-initialized.
db_file_pathOptional[Path]: Path to the SQLite database file for links. Auto-initialized.
RE_ORIENTED_SEG_GT_LTre.Pattern: Compiled regular expression for parsing oriented segments using ‘>’ and ‘<’. Auto-initialized.
RE_ORIENTED_SEG_PLUS_MINUSre.Pattern: Compiled regular expression for parsing oriented segments using ‘+’ and ‘-’. Auto-initialized.

Attributes Summary

`bam_path`
`bam_segments_file`
`bed_path`
`db_file_path`
`db_links`
`found_minigraph`
`gfa_name`
`header_gfa_file`
`index_links`
`input_genome_size`
`inverted_links_count`
`link_count`
`max_walk_rank`
`negative_links_count`
`progress`
`sample_reference`
`segment_count`
`self_links_count`
`shared_executor`
`stats_file`
`sum_rank0_length`
`threads`
`total_segment_length`
`version`
`walks_count`
`works_path`

Methods Summary

`compute_statistics`()	Computes various statistics about the parsed GFA graph, including segment counts, lengths, link properties, and connectivity (if link indexing is enabled).
`parse_gfa`()	Parses the GFA file line by line, processing headers, segments, links, and walks.
`run`()	Synchronous entry point to orchestrate GFA parsing and BED file sorting.
`save_header`()	Save the GFA header lines (H lines) to a text file.
`save_samples_chrom`()	Save sample-chromosome-fragment information to 'samples_chrom.txt'.
`tag_bam`()	Tags the generated BAM segments file with sample walk information (SW tag).

Attributes Documentation

bam_path: Path | None = None

bam_segments_file: Path | None = None

bed_path: Path | None = None

db_file_path: Path | None = None

db_links: AsyncGfaDatabase | None = None

found_minigraph: bool = False

gfa_name: str | None = None

header_gfa_file: Path | None = None

index_links: bool = False

input_genome_size: int = 0

inverted_links_count: int = 0

link_count: int = 0

max_walk_rank: int = 0

negative_links_count: int = 0

progress: Progress | None = None

sample_reference: str | None = None

segment_count: int = 0

self_links_count: int = 0

shared_executor: ThreadPoolExecutor | None = None

stats_file: Path | None = None

sum_rank0_length: int = 0

threads: int = 1

total_segment_length: int = 0

version: str | None = None

walks_count: int = 0

works_path: Path | None = None

Methods Documentation

compute_statistics()[source]

Computes various statistics about the parsed GFA graph, including segment counts, lengths, link properties, and connectivity (if link indexing is enabled). Saves these statistics to a file.

Returns

Dict[str, Any]: A dictionary containing all computed statistics, categorized.

Return type:: Dict[str, Any]

async parse_gfa()[source]

Parses the GFA file line by line, processing headers, segments, links, and walks. Segments are written to a BAM file. Links are (optionally) stored in an async SQLite DB. Walk information is used to generate data for BED files, written by AsyncBedWriter.

Return type:: None

run()[source]: Synchronous entry point to orchestrate GFA parsing and BED file sorting. Sets up an asyncio event loop and runs the asynchronous parse_gfa method. Then, sorts the generated BED files using a ThreadPoolExecutor.

save_header()[source]

Save the GFA header lines (H lines) to a text file.

Return type:: None

save_samples_chrom()[source]

Save sample-chromosome-fragment information to ‘samples_chrom.txt’. This data is derived from GFA Walk (W) lines.

Return type:: None

tag_bam()[source]

Tags the generated BAM segments file with sample walk information (SW tag). This uses the GratoolsBam class to perform the tagging. The original BAM segments file is overwritten with the tagged version.

Return type:: None

Parameters:

gfa_path (Path)
threads (int)
logger (Logger)
gfa_name (str | None)
version (str | None)
header_gfa (List[str])
sample_reference (str | None)
bam_segments_file (Path | None)
dict_samples_chrom (defaultdict)
dict_segments_size (defaultdict)
dict_segments_samples (defaultdict)
dict_samples_bed (defaultdict)
works_path (Path | None)
bed_path (Path | None)
bam_path (Path | None)
found_minigraph (bool)
index_links (bool)
db_links (AsyncGfaDatabase | None)
segment_count (int)
total_segment_length (int)
link_count (int)
degrees (defaultdict)
walks_count (int)
max_walk_rank (int)
sum_rank0_length (int)
input_genome_size (int)
walks_info (List[Dict[str, Any]])
inverted_links_count (int)
negative_links_count (int)
self_links_count (int)
isolated_segments (set)
shared_executor (ThreadPoolExecutor | None)
progress (Progress | None)
line_type_counts (Counter)
header_gfa_file (Path | None)
stats_file (Path | None)
db_file_path (Path | None)