GFA
- class gratools.GFA(gfa_path, threads=1, logger=<factory>, gfa_name=None, version=None, header_gfa=<factory>, sample_reference=None, bam_segments_file=None, dict_samples_chrom=<factory>, dict_segments_size=<factory>, dict_segments_samples=<factory>, dict_samples_bed=<factory>, works_path=None, bed_path=None, bam_path=None, found_minigraph=False, index_links=False, db_links=None, segment_count=0, total_segment_length=0, link_count=0, degrees=<factory>, walks_count=0, max_walk_rank=0, sum_rank0_length=0, input_genome_size=0, walks_info=<factory>, inverted_links_count=0, negative_links_count=0, self_links_count=0, isolated_segments=<factory>, shared_executor=None, progress=None, line_type_counts=<factory>, header_gfa_file=None, stats_file=None, db_file_path=None)[source]
Bases:
objectManage parsing of a GFA (Graphical Fragment Assembly) file, compute statistics, and generate related files such as BAM-containing segments files and BED files for path per sample. Fill an asynchronous database for links and asynchronous BED writer.
Attributes
- gfa_pathPath
Path to the input GFA file (can be .gfa or .gfa.gz).
- threadsint, optional
Number of threads for operations like BAM file processing. Default is 1.
- loggerlogging.Logger
Logger object. Default is a logger named “GraTools”.
- gfa_nameOptional[str]
Name of the GFA file derived from gfa_path (without extensions). Auto-initialized.
- versionOptional[str]
GFA version extracted from the header (e.g., “1.0”). Auto-initialized.
- header_gfaList[str]
List of header lines (H lines) from the GFA file. Auto-initialized.
- sample_referenceOptional[str]
Reference sample name, potentially from GFA header (RS tag). Auto-initialized.
- bam_segments_fileOptional[Path]
Path to the BAM file where segments (S lines) will be written. Auto-initialized.
- dict_samples_chromdefaultdict[str, OrderedDict[str, List[str]]]
Map sample names to an OrderedDict of chromosome names, which in turn map to a list of “starttstop” fragment strings derived from Walk (W) lines. Auto-initialized.
- dict_segments_sizedefaultdict[str, int]
Map segment IDs and their length (in base pairs). Auto-initialized.
- dict_segments_samplesdefaultdict[str, List[str]]
Map segment IDs to a list of sample identifiers (“sample;chromosome;haplotype”) that contain the segment. Auto-initialized.
- dict_samples_beddefaultdict[str, OrderedDict[str, Path]]
(Note: This attribute is not directly populated by the current parsing logic. It was likely intended to track the paths of generated BED files. The AsyncBedWriter internally manages these paths. This attribute might be redundant or used for post-processing tracking). Auto-initialized.
- works_pathOptional[Path]
Path to the working directory (e.g., “…/{gfa_name}_GraTools_INDEX”). Auto-initialized.
- bed_pathOptional[Path]
Path to the subdirectory for BED files within works_path. Auto-initialized.
- bam_pathOptional[Path]
Path to the subdirectory for BAM files within works_path. Auto-initialized.
- found_minigraphbool
Flag indicating if a sample named ‘MINIGRAPH’ (case-insensitive) was found in Walk lines. Defaults to False.
- index_linksbool, optional
If True, GFA links (L lines) are stored in an SQLite database. Defaults to True.
- db_linksOptional[AsyncGfaDatabase]
Asynchronous database handler for GFA links. Auto-initialized if index_links is True.
- segment_countint
Total number of segments (S lines) processed. Defaults to 0.
- total_segment_lengthint
Sum of lengths of all segments. Defaults to 0.
- link_countint
Total number of links (L lines) processed. Defaults to 0.
- degreesdefaultdict[str, int]
Maps segment IDs to their degree (number of links connected). Defaults to an empty defaultdict.
- walks_countint
Total number of walks (W lines) processed. Defaults to 0.
- max_walk_rankint
Maximum number of segments in any single walk. Defaults to 0.
- sum_rank0_lengthint
Sum of lengths of the first segments of all walks. Defaults to 0.
- input_genome_sizeint
Cumulative size of all paths (sum of segment lengths along each walk). Defaults to 0.
- walks_infoList[Dict[str, Any]]
List of dictionaries, each containing info for a walk (Path name, Sequence length, Num Segments). Defaults to an empty list.
- inverted_links_countint
Count of links where orientations differ (e.g., S1+ -> S2-). Defaults to 0.
- negative_links_countint
Count of links where both segments have negative orientation (S1- -> S2-). Defaults to 0.
- self_links_countint
Count of links where a segment links to itself (S1 -> S1). Defaults to 0.
- isolated_segmentsSet[str]
Set of segment IDs that have no links connected to them. Initialized with all segments, then linked ones removed. Defaults to an empty set.
- shared_executorOptional[ThreadPoolExecutor]
Executor for running synchronous tasks in threads. Auto-initialized.
- progressOptional[Progress]
Rich Progress instance for displaying progress. Auto-initialized.
- line_type_countsCounter
Counts of each GFA line type (H, S, L, W, P, C, E, U). Auto-initialized.
- header_gfa_fileOptional[Path]
Path to where the GFA header is saved. Auto-initialized.
- stats_fileOptional[Path]
Path to where GFA statistics are saved. Auto-initialized.
- db_file_pathOptional[Path]
Path to the SQLite database file for links. Auto-initialized.
- RE_ORIENTED_SEG_GT_LTre.Pattern
Compiled regular expression for parsing oriented segments using ‘>’ and ‘<’. Auto-initialized.
- RE_ORIENTED_SEG_PLUS_MINUSre.Pattern
Compiled regular expression for parsing oriented segments using ‘+’ and ‘-’. Auto-initialized.
Attributes Summary
Methods Summary
Computes various statistics about the parsed GFA graph, including segment counts, lengths, link properties, and connectivity (if link indexing is enabled).
Parses the GFA file line by line, processing headers, segments, links, and walks.
run()Synchronous entry point to orchestrate GFA parsing and BED file sorting.
Save the GFA header lines (H lines) to a text file.
Save sample-chromosome-fragment information to 'samples_chrom.txt'.
tag_bam()Tags the generated BAM segments file with sample walk information (SW tag).
Attributes Documentation
- db_links: AsyncGfaDatabase | None = None
Methods Documentation
- compute_statistics()[source]
Computes various statistics about the parsed GFA graph, including segment counts, lengths, link properties, and connectivity (if link indexing is enabled). Saves these statistics to a file.
Returns
- Dict[str, Any]
A dictionary containing all computed statistics, categorized.
- async parse_gfa()[source]
Parses the GFA file line by line, processing headers, segments, links, and walks. Segments are written to a BAM file. Links are (optionally) stored in an async SQLite DB. Walk information is used to generate data for BED files, written by AsyncBedWriter.
- Return type:
None
- run()[source]
Synchronous entry point to orchestrate GFA parsing and BED file sorting. Sets up an asyncio event loop and runs the asynchronous parse_gfa method. Then, sorts the generated BED files using a ThreadPoolExecutor.
- Parameters:
gfa_path (Path)
threads (int)
logger (Logger)
gfa_name (str | None)
version (str | None)
sample_reference (str | None)
bam_segments_file (Path | None)
dict_samples_chrom (defaultdict)
dict_segments_size (defaultdict)
dict_segments_samples (defaultdict)
dict_samples_bed (defaultdict)
works_path (Path | None)
bed_path (Path | None)
bam_path (Path | None)
found_minigraph (bool)
index_links (bool)
db_links (AsyncGfaDatabase | None)
segment_count (int)
total_segment_length (int)
link_count (int)
degrees (defaultdict)
walks_count (int)
max_walk_rank (int)
sum_rank0_length (int)
input_genome_size (int)
inverted_links_count (int)
negative_links_count (int)
self_links_count (int)
isolated_segments (set)
shared_executor (ThreadPoolExecutor | None)
progress (Progress | None)
line_type_counts (Counter)
header_gfa_file (Path | None)
stats_file (Path | None)
db_file_path (Path | None)