GFA

class gratools.GFA(gfa_path, threads=1, logger=<factory>, gfa_name=None, version=None, header_gfa=<factory>, sample_reference=None, bam_segments_file=None, dict_samples_chrom=<factory>, dict_segments_size=<factory>, dict_segments_samples=<factory>, dict_samples_bed=<factory>, works_path=None, bed_path=None, bam_path=None, found_minigraph=False, index_links=False, db_links=None, segment_count=0, total_segment_length=0, link_count=0, degrees=<factory>, walks_count=0, max_walk_rank=0, sum_rank0_length=0, input_genome_size=0, walks_info=<factory>, inverted_links_count=0, negative_links_count=0, self_links_count=0, isolated_segments=<factory>, shared_executor=None, progress=None, line_type_counts=<factory>, header_gfa_file=None, stats_file=None, db_file_path=None)[source]

Bases: object

Manage parsing of a GFA (Graphical Fragment Assembly) file, compute statistics, and generate related files such as BAM-containing segments files and BED files for path per sample. Fill an asynchronous database for links and asynchronous BED writer.

Attributes

gfa_pathPath

Path to the input GFA file (can be .gfa or .gfa.gz).

threadsint, optional

Number of threads for operations like BAM file processing. Default is 1.

loggerlogging.Logger

Logger object. Default is a logger named “GraTools”.

gfa_nameOptional[str]

Name of the GFA file derived from gfa_path (without extensions). Auto-initialized.

versionOptional[str]

GFA version extracted from the header (e.g., “1.0”). Auto-initialized.

header_gfaList[str]

List of header lines (H lines) from the GFA file. Auto-initialized.

sample_referenceOptional[str]

Reference sample name, potentially from GFA header (RS tag). Auto-initialized.

bam_segments_fileOptional[Path]

Path to the BAM file where segments (S lines) will be written. Auto-initialized.

dict_samples_chromdefaultdict[str, OrderedDict[str, List[str]]]

Map sample names to an OrderedDict of chromosome names, which in turn map to a list of “starttstop” fragment strings derived from Walk (W) lines. Auto-initialized.

dict_segments_sizedefaultdict[str, int]

Map segment IDs and their length (in base pairs). Auto-initialized.

dict_segments_samplesdefaultdict[str, List[str]]

Map segment IDs to a list of sample identifiers (“sample;chromosome;haplotype”) that contain the segment. Auto-initialized.

dict_samples_beddefaultdict[str, OrderedDict[str, Path]]

(Note: This attribute is not directly populated by the current parsing logic. It was likely intended to track the paths of generated BED files. The AsyncBedWriter internally manages these paths. This attribute might be redundant or used for post-processing tracking). Auto-initialized.

works_pathOptional[Path]

Path to the working directory (e.g., “…/{gfa_name}_GraTools_INDEX”). Auto-initialized.

bed_pathOptional[Path]

Path to the subdirectory for BED files within works_path. Auto-initialized.

bam_pathOptional[Path]

Path to the subdirectory for BAM files within works_path. Auto-initialized.

found_minigraphbool

Flag indicating if a sample named ‘MINIGRAPH’ (case-insensitive) was found in Walk lines. Defaults to False.

index_linksbool, optional

If True, GFA links (L lines) are stored in an SQLite database. Defaults to True.

db_linksOptional[AsyncGfaDatabase]

Asynchronous database handler for GFA links. Auto-initialized if index_links is True.

segment_countint

Total number of segments (S lines) processed. Defaults to 0.

total_segment_lengthint

Sum of lengths of all segments. Defaults to 0.

link_countint

Total number of links (L lines) processed. Defaults to 0.

degreesdefaultdict[str, int]

Maps segment IDs to their degree (number of links connected). Defaults to an empty defaultdict.

walks_countint

Total number of walks (W lines) processed. Defaults to 0.

max_walk_rankint

Maximum number of segments in any single walk. Defaults to 0.

sum_rank0_lengthint

Sum of lengths of the first segments of all walks. Defaults to 0.

input_genome_sizeint

Cumulative size of all paths (sum of segment lengths along each walk). Defaults to 0.

walks_infoList[Dict[str, Any]]

List of dictionaries, each containing info for a walk (Path name, Sequence length, Num Segments). Defaults to an empty list.

inverted_links_countint

Count of links where orientations differ (e.g., S1+ -> S2-). Defaults to 0.

negative_links_countint

Count of links where both segments have negative orientation (S1- -> S2-). Defaults to 0.

self_links_countint

Count of links where a segment links to itself (S1 -> S1). Defaults to 0.

isolated_segmentsSet[str]

Set of segment IDs that have no links connected to them. Initialized with all segments, then linked ones removed. Defaults to an empty set.

shared_executorOptional[ThreadPoolExecutor]

Executor for running synchronous tasks in threads. Auto-initialized.

progressOptional[Progress]

Rich Progress instance for displaying progress. Auto-initialized.

line_type_countsCounter

Counts of each GFA line type (H, S, L, W, P, C, E, U). Auto-initialized.

header_gfa_fileOptional[Path]

Path to where the GFA header is saved. Auto-initialized.

stats_fileOptional[Path]

Path to where GFA statistics are saved. Auto-initialized.

db_file_pathOptional[Path]

Path to the SQLite database file for links. Auto-initialized.

RE_ORIENTED_SEG_GT_LTre.Pattern

Compiled regular expression for parsing oriented segments using ‘>’ and ‘<’. Auto-initialized.

RE_ORIENTED_SEG_PLUS_MINUSre.Pattern

Compiled regular expression for parsing oriented segments using ‘+’ and ‘-’. Auto-initialized.

Attributes Summary

bam_path

bam_segments_file

bed_path

db_file_path

db_links

found_minigraph

gfa_name

header_gfa_file

index_links

input_genome_size

inverted_links_count

link_count

max_walk_rank

negative_links_count

progress

sample_reference

segment_count

self_links_count

shared_executor

stats_file

sum_rank0_length

threads

total_segment_length

version

walks_count

works_path

Methods Summary

compute_statistics()

Computes various statistics about the parsed GFA graph, including segment counts, lengths, link properties, and connectivity (if link indexing is enabled).

parse_gfa()

Parses the GFA file line by line, processing headers, segments, links, and walks.

run()

Synchronous entry point to orchestrate GFA parsing and BED file sorting.

save_header()

Save the GFA header lines (H lines) to a text file.

save_samples_chrom()

Save sample-chromosome-fragment information to 'samples_chrom.txt'.

tag_bam()

Tags the generated BAM segments file with sample walk information (SW tag).

Attributes Documentation

bam_path: Path | None = None
bam_segments_file: Path | None = None
bed_path: Path | None = None
db_file_path: Path | None = None
found_minigraph: bool = False
gfa_name: str | None = None
header_gfa_file: Path | None = None
input_genome_size: int = 0
max_walk_rank: int = 0
progress: Progress | None = None
sample_reference: str | None = None
segment_count: int = 0
shared_executor: ThreadPoolExecutor | None = None
stats_file: Path | None = None
sum_rank0_length: int = 0
threads: int = 1
total_segment_length: int = 0
version: str | None = None
walks_count: int = 0
works_path: Path | None = None

Methods Documentation

compute_statistics()[source]

Computes various statistics about the parsed GFA graph, including segment counts, lengths, link properties, and connectivity (if link indexing is enabled). Saves these statistics to a file.

Returns

Dict[str, Any]

A dictionary containing all computed statistics, categorized.

Return type:

Dict[str, Any]

async parse_gfa()[source]

Parses the GFA file line by line, processing headers, segments, links, and walks. Segments are written to a BAM file. Links are (optionally) stored in an async SQLite DB. Walk information is used to generate data for BED files, written by AsyncBedWriter.

Return type:

None

run()[source]

Synchronous entry point to orchestrate GFA parsing and BED file sorting. Sets up an asyncio event loop and runs the asynchronous parse_gfa method. Then, sorts the generated BED files using a ThreadPoolExecutor.

save_header()[source]

Save the GFA header lines (H lines) to a text file.

Return type:

None

save_samples_chrom()[source]

Save sample-chromosome-fragment information to ‘samples_chrom.txt’. This data is derived from GFA Walk (W) lines.

Return type:

None

tag_bam()[source]

Tags the generated BAM segments file with sample walk information (SW tag). This uses the GratoolsBam class to perform the tagging. The original BAM segments file is overwritten with the tagged version.

Return type:

None

Parameters: