GraTools Package
Developer API Reference and Internal Module Documentation
—
This page contains the auto-generated API documentation for GraTools. It is intended for developers who wish to contribute to the project or use GraTools as a Python library. All modules are documented with their respective members, functions, and inheritance.
—
Core Modules
This module handles the core pangenome graph structures and GFA parsing logic.
- class gratools.Graph.LinkInfo(seg_id_1, orient_seg_1, orient_key_seg_1, seg_id_2, orient_seg_2, orient_key_seg_2)
Bases:
tupleInformation about a link between two segments.
Attributes
- seg_id_1str
Identifier of the first segment.
- orient_seg_1int
Orientation of the first segment (+1 or -1).
- orient_key_seg_1int
Orientation key for the first segment (often the same as orient_seg_1).
- seg_id_2str
Identifier of the second segment.
- orient_seg_2int
Orientation of the second segment (+1 or -1).
- orient_key_seg_2int
Orientation key for the second segment (often the same as orient_seg_2).
- orient_key_seg_1
Alias for field number 2
- orient_key_seg_2
Alias for field number 5
- orient_seg_1
Alias for field number 1
- orient_seg_2
Alias for field number 4
- seg_id_1
Alias for field number 0
- seg_id_2
Alias for field number 3
- class gratools.Graph.SubGraph(bam_path, bed_path, logger=<factory>, sample_name=None, sample_name_query=None, chromosome_query=None, start_query=None, stop_query=None, offset_first=0, offset_last=0, add_start_bases_first_segment=0, intersect_bed=None, segment_id_set=<factory>, segment_id_first_query=None, segment_id_first_strand=None, segment_id_last_query=None, segment_id_last_strand=None, segment_id_first=None, segment_id_last=None, works_path=None, merge=None, build_fasta_flag=False, gfa_walk_list=<factory>, gfa_link_list=<factory>, gfa_segment_list=<factory>, dict_segments_samples=<factory>, dict_segments_sequence=<factory>, sequences_list=<factory>, progress_dict=None, task_id=None, regions=None, intersected_results_by_regions=None)[source]
Bases:
objectA class representing a subgraph in a genomic analysis pipeline. It handles operations like BED file intersection, GFA walk/link/segment building, and FASTA sequence generation for a specific sample or region.
Attributes
- bam_pathPath
The file path to the BAM file.
- bed_pathPath
The file path to the BED file.
- loggerlogging.Logger
The logger instance for logging messages. Defaults to a logger named “GraTools”.
- sample_nameOptional[str]
The name of the sample, derived from bed_path. Default to None.
- sample_name_queryOptional[str]
The name of the sample being queried. Default to None.
- chromosome_queryOptional[str]
The chromosome name for the query. Default to None.
- start_queryOptional[int]
The start position on the chromosome for the query. Default to None.
- stop_queryOptional[int]
The stop position on the chromosome for the query. Default to None.
- offset_firstint
The offset for the first segment in the query region. Default to 0.
- offset_lastint
The offset for the last segment in the query region. Default to 0.
- add_start_bases_first_segmentint
Additional start bases for the first segment (usage seems specific, consider clarifying). Default to 0.
- intersect_bedOptional[BedTool]
The BedTool object for intersected BED regions. Default to None.
- segment_id_setSet[str]
A set storing unique segment IDs encountered during walk building.
- segment_id_first_queryOptional[str]
The ID of the first segment in the query region. Default to None.
- segment_id_first_strandOptional[str]
The strand (‘+’ or ‘-’) of the first segment ID in the query. Default to None.
- segment_id_last_queryOptional[str]
The ID of the last segment ID in the query region. Default to None.
- segment_id_last_strandOptional[str]
The strand (‘+’ or ‘-’) of the last segment ID in the query. Default to None.
- segment_id_firstOptional[str]
The first segment ID encountered for the current sample (might be same as query). Default to None.
- segment_id_lastOptional[str]
The last segment ID encountered for the current sample (might be same as query). Default to None.
- works_pathOptional[Path]
The working directory path. Defaults to None.
- mergeOptional[int]
Merge distance parameter for BED operations (e.g., bedtools merge -d). Default to None.
- build_fasta_flagOptional[bool]
Flag indicating whether FASTA sequences should be built. Default to False.
- gfa_walk_listList[str]
A list of GFA walk strings (W lines). Default to an empty list.
- gfa_link_listList[str]
A list of GFA link strings (L lines). Default to an empty list.
- gfa_segment_listList[str]
A list of GFA segment strings (S lines). Default to an empty list.
- dict_segments_samplesdefaultdict[str, List[str]]
A dictionary mapping segment IDs to a list of sample identifiers. Default to an empty defaultdict.
- dict_segments_sequencedefaultdict[str, str]
A dictionary mapping segment IDs to their sequences. Default to an empty defaultdict.
- sequences_listList[SeqRecord]
A list of Biopython SeqRecord objects for generated FASTA sequences. Default to an empty list.
- progress_dictOptional[Dict]
A dictionary to track progress, typically for multi-processing. Default to None.
- task_idOptional[TaskID]
The task ID for progress tracking with rich.progress. Default to None.
- regionsOptional[List[Dict[str, Any]]]
A list of regions (dictionaries with ‘chromosome’, ‘start’, ‘stop’). Default to None.
- intersected_results_by_regionsOptional[BedTool]
A BedTool object containing combined intersected results for all regions. Default to None.
- dict_segments_samples: defaultdict = <dataclasses._MISSING_TYPE object>
- dict_segments_sequence: defaultdict = <dataclasses._MISSING_TYPE object>
- compute_intersection()[source]
Compute the intersection of BED regions for each region in self.regions. Store the results in self.intersected_results_by_regions.
- Return type:
None
- build_walks()[source]
Build GFA walks (W lines) and links (L lines) from the intersected BED regions.
- Return type:
None
- build_segments()[source]
Build GFA segments (S lines) from the BAM file for segments in self.segment_id_set.
- Return type:
None
- filter_bed_with_awk()[source]
Filter a BED file using an awk command line to extract only lines containing an ID of interest.
- Return type:
BedTool
- get_chr_pos(progress_dict, task_id)[source]
Identify chromosomal regions corresponding to segments in self.segment_id_set, then compute intersections, and finally build walks and segments for these regions.
Parameters
- progress_dictOptional[Dict]
A dictionary to track progress for multiprocessing.
- task_idOptional[TaskID]
The task ID for rich.progress tracking.
- Parameters:
progress_dict (Dict | None)
task_id (TaskID | None)
- Return type:
None
- build_fasta()[source]
Build FASTA sequences from the GFA walks (self.gfa_walk_list).
This method recovers FASTA sequences by processing each walk, extracting the constitutive segments, and applying specific offsets if the current sample is the query sample (self.sample_name == self.sample_name_query). The offsets trim segment sequences at the beginning or end of the walk to match the precise query region boundaries (self.start_query, self.stop_query). The resulting sequences are stored as SeqRecord objects in self.sequences_list.
- Return type:
None
- Parameters:
bam_path (Path)
bed_path (Path)
logger (Logger)
sample_name (str | None)
sample_name_query (str | None)
chromosome_query (str | None)
start_query (int | None)
stop_query (int | None)
offset_first (int)
offset_last (int)
add_start_bases_first_segment (int)
intersect_bed (BedTool | None)
segment_id_first_query (str | None)
segment_id_first_strand (str | None)
segment_id_last_query (str | None)
segment_id_last_strand (str | None)
segment_id_first (str | None)
segment_id_last (str | None)
works_path (Path | None)
merge (int | None)
build_fasta_flag (bool | None)
dict_segments_samples (defaultdict)
dict_segments_sequence (defaultdict)
sequences_list (List[SeqRecord])
progress_dict (Dict | None)
task_id (TaskID | None)
intersected_results_by_regions (BedTool | None)
- class gratools.Graph.AsyncGfaDatabase(db_file, timeout=30.0)[source]
Bases:
objectManage an asynchronous SQLite database for storing and querying GFA link data. It uses aiosqlite for non-blocking operations within an asyncio event loop and serializes writes via an internal FIFO queue to prevent SQLite lock contention.
Attributes
- db_filePath
Path to the SQLite database file.
- timeoutfloat
Maximum timeout (in seconds) for SQLite lock acquisition.
- loggerlogging.Logger
Logger instance.
- _connOptional[aiosqlite.Connection]
Shared SQLite connection (or None if not connected).
- _write_queueasyncio.Queue
Asynchronous queue for batches of links to be inserted. Max size 100.
- _sql_taskOptional[asyncio.Task]
Background task consuming the queue and writing to the database.
- _shutdownbool
Flag to signal shutdown to the writer task.
- __init__(db_file, timeout=30.0)[source]
Initialize the instance without opening the connection. Scheme is created upon the first call to connect().
- Parameters:
db_file (Path) – Path to the SQLite file to be used as backend.
timeout (float, optional) – Maximum wait time for SQLite locks (default 30.0s).
- async connect()[source]
Connect to the SQLite db (if not already connected), configure PRAGMA settings, create the ‘links’ table schema (if it doesn’t exist), and start the SQL writer task. This method is idempotent: if already connected, it does nothing.
- Return type:
None
- async batch_insert_links(links)[source]
Enqueue a batch of links for non-blocking insertion. Must be called after await connect().
- async create_indexes()[source]
Create indexes on seg_id_1 and seg_id_2 to accelerate queries. Should ideally be called after all data insertions are complete.
- Return type:
None
- async query_links_by_segment(segment_id)[source]
Retrieves all links where segment_id appears as seg_id_1 or seg_id_2.
- async test_query_links(segment_id)[source]
Retrieve and categorize links related to a given segment. - “before”: links where segment_id is seg_id_2. - “after”: links where segment_id is seg_id_1.
- async find_children_and_grandchildren(node_id)[source]
Find direct successors (children) and second-degree successors (grandchildren) of a segment. A child is seg_id_2 where node_id is seg_id_1. A grandchild is a child of a child.
- async close()[source]
Properly shut down the database: - Signal the SQL writer task to stop. - Wait for the writer task to finish processing its queue (with timeout). - Cancel the task if it doesn’t finish in time. - Create indexes (important to do this after all writes). - Close the SQLite connection.
- Return type:
None
- class gratools.Graph.AsyncBedWriter(bed_dir, batch_size=1, progress=None)[source]
Bases:
objectAsynchronous BED file writer using aiofiles. It buffers lines per sample and writes them in batches to separate BED files.
Attributes
- bed_dirPath
Output directory for .bed files.
- batch_sizeint
Number of lines to buffer per sample before an automatic flush.
- progressOptional[Progress]
Rich Progress instance for displaying progress (if provided).
- loggerlogging.Logger
Logger instance.
- _queueasyncio.Queue[Tuple[str, List[str]]]
Internal queue for (sample_name, lines_to_write) tuples. Max size 100.
- _shutdownbool
Flag to signal shutdown to the writer loop.
- _taskOptional[asyncio.Task]
The background asyncio task running the _writer_loop.
- start()[source]
Start the writer loop as a background task if not already running.
- Return type:
None
- async enqueue(sample, lines)[source]
Add a sample and its corresponding lines to the write queue. Use put_nowait assuming the queue rarely fills; consider await self._queue.put() if backpressure to the producer is acceptable when the queue is full.
- class gratools.Graph.GFA(gfa_path, threads=1, logger=<factory>, disable_progress_flag=False, gfa_name=None, version=None, header_gfa=<factory>, sample_reference=None, bam_segments_file=None, dict_samples_chrom=<factory>, dict_segments_size=<factory>, dict_segments_samples=<factory>, dict_samples_bed=<factory>, works_path=None, bed_path=None, bam_path=None, found_minigraph=False, index_links=False, db_links=None, segment_count=0, total_segment_length=0, link_count=0, degrees=<factory>, walks_count=0, max_walk_rank=0, sum_rank0_length=0, input_genome_size=0, walks_info=<factory>, inverted_links_count=0, negative_links_count=0, self_links_count=0, isolated_segments=<factory>, shared_executor=None, progress=None, line_type_counts=<factory>, header_gfa_file=None, stats_file=None, db_file_path=None)[source]
Bases:
objectManage parsing of a GFA (Graphical Fragment Assembly) file, compute statistics, and generate related files such as BAM-containing segments files and BED files for path per sample. Fill an asynchronous database for links and asynchronous BED writer.
Attributes
- gfa_pathPath
Path to the input GFA file (can be .gfa or .gfa.gz).
- threadsint, optional
Number of threads for operations like BAM file processing. Default is 1.
- loggerlogging.Logger
Logger object. Default is a logger named “GraTools”.
- gfa_nameOptional[str]
Name of the GFA file derived from gfa_path (without extensions). Auto-initialized.
- versionOptional[str]
GFA version extracted from the header (e.g., “1.0”). Auto-initialized.
- header_gfaList[str]
List of header lines (H lines) from the GFA file. Auto-initialized.
- sample_referenceOptional[str]
Reference sample name, potentially from GFA header (RS tag). Auto-initialized.
- bam_segments_fileOptional[Path]
Path to the BAM file where segments (S lines) will be written. Auto-initialized.
- dict_samples_chromdefaultdict[str, OrderedDict[str, List[str]]]
Map sample names to an OrderedDict of chromosome names, which in turn map to a list of “starttstop” fragment strings derived from Walk (W) lines. Auto-initialized.
- dict_segments_sizedefaultdict[str, int]
Map segment IDs and their length (in base pairs). Auto-initialized.
- dict_segments_samplesdefaultdict[str, List[str]]
Map segment IDs to a list of sample identifiers (“sample;chromosome;haplotype”) that contain the segment. Auto-initialized.
- dict_samples_beddefaultdict[str, OrderedDict[str, Path]]
(Note: This attribute is not directly populated by the current parsing logic. It was likely intended to track the paths of generated BED files. The AsyncBedWriter internally manages these paths. This attribute might be redundant or used for post-processing tracking). Auto-initialized.
- works_pathOptional[Path]
Path to the working directory (e.g., “…/{gfa_name}_GraTools_INDEX”). Auto-initialized.
- bed_pathOptional[Path]
Path to the subdirectory for BED files within works_path. Auto-initialized.
- bam_pathOptional[Path]
Path to the subdirectory for BAM files within works_path. Auto-initialized.
- found_minigraphbool
Flag indicating if a sample named ‘MINIGRAPH’ (case-insensitive) was found in Walk lines. Defaults to False.
- index_linksbool, optional
If True, GFA links (L lines) are stored in an SQLite database. Defaults to True.
- db_linksOptional[AsyncGfaDatabase]
Asynchronous database handler for GFA links. Auto-initialized if index_links is True.
- segment_countint
Total number of segments (S lines) processed. Defaults to 0.
- total_segment_lengthint
Sum of lengths of all segments. Defaults to 0.
- link_countint
Total number of links (L lines) processed. Defaults to 0.
- degreesdefaultdict[str, int]
Maps segment IDs to their degree (number of links connected). Defaults to an empty defaultdict.
- walks_countint
Total number of walks (W lines) processed. Defaults to 0.
- max_walk_rankint
Maximum number of segments in any single walk. Defaults to 0.
- sum_rank0_lengthint
Sum of lengths of the first segments of all walks. Defaults to 0.
- input_genome_sizeint
Cumulative size of all paths (sum of segment lengths along each walk). Defaults to 0.
- walks_infoList[Dict[str, Any]]
List of dictionaries, each containing info for a walk (Path name, Sequence length, Num Segments). Defaults to an empty list.
- inverted_links_countint
Count of links where orientations differ (e.g., S1+ -> S2-). Defaults to 0.
- negative_links_countint
Count of links where both segments have negative orientation (S1- -> S2-). Defaults to 0.
- self_links_countint
Count of links where a segment links to itself (S1 -> S1). Defaults to 0.
- isolated_segmentsSet[str]
Set of segment IDs that have no links connected to them. Initialized with all segments, then linked ones removed. Defaults to an empty set.
- shared_executorOptional[ThreadPoolExecutor]
Executor for running synchronous tasks in threads. Auto-initialized.
- progressOptional[Progress]
Rich Progress instance for displaying progress. Auto-initialized.
- line_type_countsCounter
Counts of each GFA line type (H, S, L, W, P, C, E, U). Auto-initialized.
- header_gfa_fileOptional[Path]
Path to where the GFA header is saved. Auto-initialized.
- stats_fileOptional[Path]
Path to where GFA statistics are saved. Auto-initialized.
- db_file_pathOptional[Path]
Path to the SQLite database file for links. Auto-initialized.
- RE_ORIENTED_SEG_GT_LTre.Pattern
Compiled regular expression for parsing oriented segments using ‘>’ and ‘<’. Auto-initialized.
- RE_ORIENTED_SEG_PLUS_MINUSre.Pattern
Compiled regular expression for parsing oriented segments using ‘+’ and ‘-’. Auto-initialized.
- disable_progress_flag: bool
If True, progress bars are disabled. Defaults to False.
- dict_samples_chrom: defaultdict = <dataclasses._MISSING_TYPE object>
- dict_segments_size: defaultdict = <dataclasses._MISSING_TYPE object>
- dict_segments_samples: defaultdict = <dataclasses._MISSING_TYPE object>
- dict_samples_bed: defaultdict = <dataclasses._MISSING_TYPE object>
- db_links: AsyncGfaDatabase | None = None
- degrees: defaultdict = <dataclasses._MISSING_TYPE object>
- sort_file_in_place()[source]
Sort a file in place using Unix commands without creating a temporary copy. The file must be in the format: sample chromosome start end
- async parse_gfa()[source]
Parses the GFA file line by line, processing headers, segments, links, and walks. Segments are written to a BAM file. Links are (optionally) stored in an async SQLite DB. Walk information is used to generate data for BED files, written by AsyncBedWriter.
- Return type:
None
- run()[source]
Synchronous entry point to orchestrate GFA parsing and BED file sorting. Sets up an asyncio event loop and runs the asynchronous parse_gfa method. Then, sorts the generated BED files using a ThreadPoolExecutor.
- Parameters:
gfa_path (Path)
threads (int)
logger (Logger)
disable_progress_flag (bool | None)
gfa_name (str | None)
version (str | None)
sample_reference (str | None)
bam_segments_file (Path | None)
dict_samples_chrom (defaultdict)
dict_segments_size (defaultdict)
dict_segments_samples (defaultdict)
dict_samples_bed (defaultdict)
works_path (Path | None)
bed_path (Path | None)
bam_path (Path | None)
found_minigraph (bool)
index_links (bool)
db_links (AsyncGfaDatabase | None)
segment_count (int)
total_segment_length (int)
link_count (int)
degrees (defaultdict)
walks_count (int)
max_walk_rank (int)
sum_rank0_length (int)
input_genome_size (int)
inverted_links_count (int)
negative_links_count (int)
self_links_count (int)
isolated_segments (set)
shared_executor (ThreadPoolExecutor | None)
progress (Progress | None)
line_type_counts (Counter)
header_gfa_file (Path | None)
stats_file (Path | None)
db_file_path (Path | None)
- class gratools.Graph.GratoolsBam(bam_path, threads=1, logger=<factory>, suffix=None, works_path=None, gfa_name=None, tagging=False, disable_progress_flag=False)[source]
Bases:
objectHandles operations related to BAM files in the GraTools context, such as indexing, extracting segment information, tagging segments, and performing various analyses (core/dispensable ratio, depth statistics, etc.).
Attributes
- bam_pathPath
Path to the BAM file.
- threadsint, optional
Number of threads for BAM operations (e.g., reading, indexing). Defaults to 1.
- loggerlogging.Logger
Logger instance. Defaults to a logger named “GraTools”.
- suffixOptional[str], optional
Suffix to append to output filenames generated by analyses. Defaults to None.
- works_pathOptional[Path], optional
Working directory path for saving output files. Defaults to None (uses BAM parent dir).
- gfa_nameOptional[str], optional
Name of the associated GFA file (used for naming output files). Defaults to None.
- taggingbool, optional
If True, indicates that operations might modify tags, potentially requiring re-indexing. Used by index_bam to decide if indexing is needed. Defaults to False.
- progressOptional[Progress]
Rich Progress instance for displaying progress. Auto-initialized.
- disable_progress_flag Optional[bool], optional
Flag to disable progress bar. Defaults to False.
- index_bam()[source]
Indexes the BAM file using pysam.index if the index is missing or outdated. The index file will have a ‘.bai’ or ‘.crai’ extension depending on BAM format.
- Return type:
None
- build_segments(list_segments=None)[source]
Extracts specified segments from a BAM file and reconstructs their GFA S-line representation. Also populates dictionaries for segment samples and sequences.
Parameters
- list_segmentsOptional[List[str]], optional
A list of segment IDs (query_name in BAM) to extract. If None or empty, this method might process all segments or return empty results, depending on intended behavior (current pysam.view call implies it needs a list).
Returns
- Tuple[List[str], defaultdict[str, List[str]], defaultdict[str, str]]
gfa_s_lines_list: List of strings, each a GFA S-line.
dict_seg_samples: defaultdict mapping segment ID to list of “sample;chrom;haplo” strings from SW tag.
dict_seg_sequence: defaultdict mapping segment ID to its sequence.
- Parameters:
- Return type:
- tag(dict_segments_samples, nb_segments)[source]
Adds or updates the ‘SW’ (Sample Walks) tag to segments in the BAM file. This version uses an integer-to-string mapping for walk paths to improve performance.
The SW tag stores a comma-separated list of “sample;chromosome;haplotype” strings indicating which walks/paths contain the segment. The original BAM file is overwritten with the tagged version.
Parameters
- dict_segments_samplesDict[str, List[int]]
Dictionary mapping segment IDs (query_name) to a list of integer IDs representing the walk paths.
- nb_segmentsint
The total number of segments in the BAM file.
Returns
- Path
The path to the (now tagged and re-indexed) BAM file.
Raises
- FileNotFoundError
If the input BAM file does not exist.
- Exception
If errors occur during BAM reading, writing, or renaming.
- core_dispensable_ratio(nb_samples_gfa, input_as_number, shared_min_cutoff=1, specific_max_cutoff=None, filter_min_len=1)[source]
Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in. Saves results to a CSV file.
Parameters
- nb_samples_gfaint
Total number of unique samples present in the GFA (used for percentage calculation).
- input_as_numberbool
If True, shared_min_cutoff and specific_max_cutoff are treated as absolute counts of samples. If False, they are treated as percentages of nb_samples_gfa.
- shared_min_cutoffint, optional
Minimum number/percentage of samples a segment must be in to be considered “shared” (core). Defaults to 1.
- specific_max_cutoffOptional[int], optional
Maximum number/percentage of samples a segment can be in to be considered “specific” (dispensable). If None, specific analysis might be skipped or use a default (e.g., 1 if input_as_number). Defaults to None.
- filter_min_lenint, optional
Minimum length (bp) for a segment to be included in the filtered analysis. Defaults to 1 (no length filter).
- depth_nodes_stat(nb_samples_gfa, filter_min_len=1)[source]
Calculates and displays statistics about segment depth (number of unique samples a segment is found in). Outputs results to console and a CSV file.
Parameters
- nb_samples_gfaint
Total number of unique samples in the GFA, used for context if needed (not directly in calcs here).
- filter_min_lenint, optional
Minimum length (bp) for a segment to be included in the filtered depth analysis. Defaults to 1 (no effective length filter).
Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.
Parameters
- samples_list_AList[str]
A list of sample names. A segment is “shared” if it is present in ALL samples in this list.
- samples_list_BOptional[List[str]], optional
An optional second list of sample names. If provided, a segment is “specific” if it is shared by all in samples_list_A AND absent from ALL samples in this list.
- filter_min_lenOptional[int], optional
If set, only segments with a length greater than or equal to this value will be considered.
- output_csvOptional[bool], optional
If True, the function will return sets of the shared and specific segment IDs.
Returns
- Tuple[Set[str], Set[str]]
A set of segment IDs that are shared by all samples in samples_list_A.
A set of segment IDs that are specific to samples_list_A relative to samples_list_B. (This set is a subset of the first one).
- get_segments_and_positions_by_depth(total_gfa_samples, input_as_number, lower_bound_depth, upper_bound_depth, filter_min_len, bed_path=None)[source]
Finds segments within a specific depth range and retrieves their genomic positions from BED files.
This function performs two main steps:
Scans the BAM file to identify segments that meet the specified depth and length criteria.
For those segments, it efficiently queries the relevant BED files to find their
exact genomic coordinates (chromosome, start, end).
Parameters
- total_gfa_samplesint
Total number of unique samples in the GFA, used for percentage calculations.
- input_as_numberbool
If True, depth bounds are absolute counts; if False, they are percentages of total_gfa_samples.
- lower_bound_depthint
Minimum sample depth (count or percentage) for a segment to be included.
- upper_bound_depthint
Maximum sample depth (count or percentage) for a segment to be included.
- filter_min_lenint
Minimum length in base pairs for a segment to be considered.
- bed_pathPath, optional
Path to the directory containing the sample-specific BED files.
Returns
- Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]
A tuple containing two dictionaries: 1. segments_with_depth: {segment_id: depth} for all segments matching the criteria. 2. segment_locations: {segment_id: {sample_name: [(chrom, start, end), …]}}
- export_nodes_to_csv(output_csv_path, core_threshold_percent=0.95)[source]
Exports enhanced node information to a CSV file using an efficient, single-pass approach.
This method aggregates data in memory before creating a final Pandas DataFrame, making it much more memory-efficient than building a list of all records. It correctly parses the ‘SW’ tag to build a list of samples for each unique node.
Parameters
- output_csv_pathPath
The path where the output CSV file will be saved.
- core_threshold_percentfloat, optional
The percentage of total samples above which a node is considered ‘core’. Defaults to 0.95.
Main application logic and high-level command orchestrations.
- gratools.Gratools.flatten(list_of_lists)[source]
Flattens a list of lists into a single list.
Parameters
- list_of_listsList[List[Any]]
A list where each element is itself a list.
Returns
- List[Any]
A new list containing all items from the sublists.
- class gratools.Gratools.Gratools(gfa_path, threads=1, outdir=None, logger=None, gfa_name=None, bam_segments_file=None, dict_samples_chrom=<factory>, works_path=None, bed_path=None, bam_path=None, samples_chrom_path=None, dict_gfa_graph_object=<factory>, sample_name_query=None, chromosome_query=None, start_query=0, stop_query=None, suffix=None, build_fasta_flag=False, merge=None, meta=<factory>, index_links=False, debug=False, disable_progress_flag=False)[source]
Bases:
objectMain class for the GraTools toolkit, orchestrating GFA file processing, subgraph extraction, and various analyses on genomic graph data.
It handles GFA indexing (delegating to the GFA class), manages input parameters, and provides an interface for operations like subgraph extraction, FASTA generation, and statistical analysis of graph components.
Attributes
- gfa_pathPath
Path to the input GFA file.
- threadsint, optional
Number of threads for parallelizable operations. Defaults to 1.
- outdirOptional[Path], optional
Output directory for GraTools results. If None, defaults to a directory named GraTools-output_{gfa_name} in the same directory as gfa_path.
- loggerOptional[logging.Logger]
Logger instance. Auto-configured in __post_init__.
- gfa_nameOptional[str]
Name of the GFA file, derived from gfa_path without extensions. Auto-initialized.
- bam_segments_fileOptional[Path]
Path to the BAM file containing GFA segments, located within the index directory. Auto-initialized.
- dict_samples_chromdefaultdict[str, OrderedDict[str, List[Tuple[str, str]]]]
Maps sample names to an OrderedDict of chromosome names, which maps to a list of (start_fragment, stop_fragment) string tuples. Populated from samples_chrom.txt.
- works_pathOptional[Path]
Path to the main GraTools output directory for the current run (e.g., outdir/GraTools-output_{gfa_name}). Auto-initialized.
- bed_pathOptional[Path]
Path to the BED files subdirectory within the GFA index directory. Auto-initialized.
- bam_pathOptional[Path]
Path to the BAM files subdirectory within the GFA index directory. Auto-initialized.
- samples_chrom_pathOptional[Path]
Path to the samples_chrom.txt file within the GFA index directory. Auto-initialized.
- dict_gfa_graph_objectDict[str, SubGraph]
Dictionary mapping sample names to their corresponding SubGraph objects after extraction. Defaults to an empty dict.
- sample_name_queryOptional[str]
Name of the primary sample for query operations (e.g., subgraph extraction). Defaults to None.
- chromosome_queryOptional[str]
Chromosome identifier for query operations. Defaults to None.
- start_queryint
Start position for query operations (0-based). Defaults to 0.
- stop_queryOptional[int]
Stop position for query operations. If None, might be inferred as chromosome end. Defaults to None.
- suffixOptional[str]
Custom suffix for output files. If None, a default suffix based on query parameters is generated. Auto-initialized.
- build_fasta_flagbool
Flag to enable FASTA file generation during subgraph extraction. Defaults to False.
- gzip_gfabool
Flag indicating if the input GFA file is gzipped. Auto-detected.
- mergeOptional[int]
Merge distance (-d for bedtools merge) for BED region processing. If -1 and query region is set, defaults to 10% of query region size. Defaults to None.
- metaDict[str, Any]
Dictionary for meta-parameters like verbosity, log_path, threads, passed from CLI or config. Defaults to an empty dict.
- index_linksbool
Flag to control whether GFA links are indexed into a database during GFA parsing. Defaults to True.
- debugbool
Flag to enable debug mode, typically for more verbose logging or error details. Defaults to False.
- index_pathOptional[Path]
Path to the GFA index directory ({gfa_name}_GraTools_INDEX). Auto-initialized.
- header_gfa_fileOptional[Path]
Path to the saved GFA header file within the index directory. Auto-initialized.
- stats_gfa_fileOptional[Path]
Path to the saved GFA statistics file within the index directory. Auto-initialized.
- sub_graph_queryOptional[SubGraph]
SubGraph object for the primary query sample. Initialized in extract_sub_graph.
- _cached_chromosome_dataOptional[pd.DataFrame] # Attribute for caching chromosome data
Internal cache for data read from samples_chrom_path to avoid redundant parsing.
- disable_progress_flag: Optional[bool]
Flag to control progress bar visibility. Defaults to False.
- dict_samples_chrom: defaultdict = <dataclasses._MISSING_TYPE object>
- get_gfa_statistics_df()[source]
Loads GFA statistics from the pre-computed statistics file into a pandas DataFrame.
- Returns:
Optional[pd.DataFrame] – DataFrame with GFA statistics, or None if file not found/readable.
- Return type:
DataFrame | None
- save_gfa_statistics()[source]
Saves the chromosome summary per sample to a CSV file.
- Return type:
None
- display_gfa_statistics(by_category=False)[source]
Displays GFA statistics in a Rich Table, either categorized or as a single table.
- Parameters:
by_category (bool, optional) – If True, display stats in separate tables per category. If False, display in a single comprehensive table. Defaults to False.
- Return type:
None
- get_chromosome_size(sample_name, chromosome_name)[source]
Gets the maximum end position (size) of a given chromosome for a specific sample. This represents the extent of the chromosome as defined by walk fragments in the GFA.
- property available_sample_names: List[str]
Retrieves a sorted list of unique sample names present in the GFA data.
- Returns:
List[str] – Sorted list of unique sample names.
- display_available_sample_names()[source]
Displays available sample names in a Rich Table.
- Return type:
None
- save_available_sample_names()[source]
Saves the list of available sample names to a CSV file.
- Return type:
None
- get_chromosomes_summary_by_sample_df()[source]
Generates a DataFrame summarizing chromosomes per sample. Includes sample name, a comma-separated list of unique chromosome names, and the count of unique chromosomes for that sample.
- Returns:
Optional[pd.DataFrame] –
- DataFrame with columns [“SAMPLES”, “CHROMOSOMES_LIST”, “NUM_UNIQUE_CHROMOSOMES”],
or None if data cannot be loaded/processed.
- Return type:
DataFrame | None
- save_chromosomes_summary_by_sample()[source]
Saves the chromosome summary per sample to a CSV file.
- Return type:
None
- save_full_chromosome_fragment_data()[source]
Saves the raw chromosome fragment data (sample, chrom, start, end) to a CSV file.
- Return type:
None
- display_chromosomes_summary()[source]
Displays the chromosome summary per sample in a Rich Table.
- Return type:
None
- display_full_chromosome_fragment_data()[source]
Displays the full chromosome fragment data in a Rich Table, grouped by sample.
- Return type:
None
- extract_subgraph(samples_list_path=None, all_samples_flag=False)[source]
Extracts a subgraph based on a query region and optionally for other specified samples. Manages SubGraph object creation, processing, and GFA/FASTA file generation.
- Parameters:
samples_list_path (Optional[Path]) – Path to a file containing a list of additional samples to process (one per line).
all_samples_flag (bool) – If True and samples_list_path is not given, process all samples found in the GFA (relative to the query region).
- Return type:
None
- concatenate_and_generate_subgfa_file()[source]
Concatenates GFA components (Header, Segments, Links, Walks) from all processed SubGraph objects and writes them to a combined GFA file (gzipped).
- Return type:
None
- generate_combined_fasta_file()[source]
Generates a combined FASTA file from sequences collected in all processed SubGraph objects.
- Return type:
None
- run_core_dispensable_ratio_analysis(input_as_number, shared_min, specific_max, filter_len)[source]
Runs core/dispensable segment ratio analysis using GratoolsBam. Parameters mirror those of GratoolsBam.core_dispensable_ratio.
- run_depth_nodes_statistics(filter_len)[source]
Runs node depth statistics analysis using GratoolsBam.
- Parameters:
filter_len (int)
- Return type:
None
- run_get_specific_groups_sample_analysis(sample_list_a_path, sample_list_b_path, filter_len, output_csv)[source]
Run get_specific_groups_sample and saves the result in a file
- Parameters:
sample_list_a (list) – List of samples to check for shared segments.
sample_list_b (list) – List of samples to check for specific segments.
filter_len (int, optional) – Minimum length of segments to be considered.
output_csv (bool) – output the segments in a csv file (if False, only print stats)
sample_list_a_path (Path | None)
sample_list_b_path (Path | None)
- Return type:
None
- find_specific_groups_sample_position(segment_list_shared, sample_list_a=None)[source]
Finds segment positions using a streaming approach to balance RAM and I/O performance. It reads awk’s output line by line without loading the full result into memory or writing intermediate filtered BED files to disk.
- get_segments_by_depth(input_as_number, lower_bound, upper_bound, filter_len)[source]
Retrieves segments within a specific depth range, using GratoolsBam. Returns a dictionary of {segment_id: depth}.
- display_or_save_segments_by_depth(input_as_number, lower_bound, upper_bound, filter_len, output_to_file)[source]
Retrieves segments by depth and either displays them in a Rich Table (if output_to_file is False) or saves them to a CSV file (if output_to_file is True).
- export_to_bandage_csv(output_csv_path=None)[source]
Exports node (segment) information to a CSV file compatible with Bandage. This method uses the indexed BAM file to calculate properties like length and depth.
- Parameters:
output_csv_path (Optional[Path]) – The path to save the output CSV file. If None, a default path is generated in the GFA directory.
- Return type:
None
- Parameters:
gfa_path (Path)
threads (int)
outdir (Path | None)
logger (Logger | None)
gfa_name (str | None)
bam_segments_file (Path | None)
dict_samples_chrom (defaultdict)
works_path (Path | None)
bed_path (Path | None)
bam_path (Path | None)
samples_chrom_path (Path | None)
sample_name_query (str | None)
chromosome_query (str | None)
start_query (int)
stop_query (int | None)
suffix (str | None)
build_fasta_flag (bool)
merge (int | None)
index_links (bool)
debug (bool)
disable_progress_flag (bool)
—
System & Utilities
Entry point for the Command Line Interface (CLI).
General purpose helper functions and genomic utilities.
- gratools.useful_function.reverse_complement_string(s)[source]
Calcule rapidement le complément inverse d’une chaîne d’ADN.
- class gratools.useful_function.CustomCommand(*args, **kwargs)[source]
Bases:
CommandCustom Command class that applies global context settings (for help formatting) and prepends a specific header to the help message of individual commands.
- __init__(*args, **kwargs)[source]
Initializes the CustomCommand.
Ensures that the predefined CONTEXT_SETTINGS are applied by default to this command.
- get_help(ctx)[source]
Overrides the default help generation to prepend a custom header.
The header is printed directly to the console using shared_console before the standard help text is generated and returned.
- Parameters:
ctx (click.Context) – The current Click context.
- Returns:
str – The formatted help text, including the prepended header.
- Return type:
- class gratools.useful_function.CustomGroup(*args, **kwargs)[source]
Bases:
GroupCustom Group class that applies global context settings, prepends a header to its own help message, and ensures that all subcommands added via its command decorator use CustomCommand by default.
- __init__(*args, **kwargs)[source]
Initializes the CustomGroup.
Ensures that the predefined CONTEXT_SETTINGS are applied by default to this group and its subcommands (if they don’t override).
- command(*args, **kwargs)[source]
Overrides the default command decorator registration.
This ensures that any command registered using this group’s command method will automatically use CustomCommand as its class, thereby inheriting the custom help formatting and header.
- Parameters:
*args – Positional arguments for the command decorator.
**kwargs – Keyword arguments for the command decorator.
- Returns:
Callable – The decorator that registers the command.
- gratools.useful_function.validate_percentage_or_int(ctx, param, value)[source]
Click callback for validating that an option’s value is either an integer or a float representing a percentage (between 0.0 and 1.0 inclusive).
This function is intended to be used as a callback for a Click option.
- Parameters:
- Returns:
int | float –
- The validated value, converted to int if it’s a whole number,
or float if it’s a percentage (0.0-1.0).
- Raises:
click.BadParameter – If the value is not a valid integer or a float between 0.0 and 1.0.
- Return type:
Configuration for the specialized logging system (Rich-based).
- class gratools.logger_config.ThreadedRichHandler(*args, **kwargs)[source]
Bases:
RichHandlerCustom RichHandler that processes log records in a separate thread to prevent blocking the main application thread, especially during I/O-bound logging operations (like writing to console or complex formatting).
It also includes a feature to trigger a program exit if a log record of ERROR level or higher is emitted through it.
Attributes
- runningbool
A flag indicating whether the log processing worker thread should continue running.
- log_queuequeue.Queue[logging.LogRecord] # Type hint for clarity
A thread-safe queue used to buffer log records before they are processed by the worker thread.
- worker_threadthreading.Thread
The background thread responsible for consuming log records from log_queue.
- critical_error_occurredbool
A flag set to True if an ERROR or CRITICAL log has been processed, leading to program termination.
- __init__(*args, **kwargs)[source]
Initializes the ThreadedRichHandler.
Sets up the log queue and starts the background worker thread.
- Return type:
None
- emit(record)[source]
Queues a log record for processing by the worker thread.
If the log record’s level is ERROR or higher, this method sets a flag to indicate a critical error and initiates the process to stop the log processing and exit the application.
Parameters
- recordlogging.LogRecord
The log record to be emitted.
- Parameters:
record (LogRecord)
- Return type:
None
- gratools.logger_config.configure_logger(name, log_dir_path, verbosity_level, file_suffix='')[source]
Configures and returns a logger instance with specified settings.
The logger will output to: 1. The console (via RichHandler for formatted, colorful output). 2. A general log file (e.g., ‘name_log.o’). 3. An error log file for WARNING and higher messages (e.g., ‘name_log.e’).
Parameters
- namestr
The name for the logger (e.g., “GraTools”).
- log_dir_pathPath
The directory path where log files will be stored.
- verbosity_levelstr
The logging verbosity level (e.g., “DEBUG”, “INFO”, “ERROR”). This sets the minimum level for messages to be processed by the logger.
- file_suffixstr, optional
An optional suffix to append to the base name of log files. Defaults to an empty string.
Returns
- logging.Logger
The configured logger instance.
- gratools.logger_config.update_logger_file_suffix(logger, new_file_suffix)[source]
Updates file handlers of a logger to use a new file suffix. Existing log content from the old file (if any) is copied to the new file location before switching. The old file is then deleted.
This is useful if, for instance, a specific operation (like a query) should have its logs in a uniquely named file determined mid-execution.
Parameters
- loggerlogging.Logger
The logger instance whose file handlers need updating.
- new_file_suffixstr
The new suffix to be incorporated into the log filenames. Example: if old was “main_log.log”, new_suffix=”_query123”, new becomes “main_log_query123.log”.
Returns
- logging.Logger
The same logger instance, now with updated file handlers.
—
Package Contents
Main package module for gratools.
- no-index:
Contains core classes and functions for GraTools.
- class gratools.AsyncGfaDatabase(db_file, timeout=30.0)[source]
Bases:
objectManage an asynchronous SQLite database for storing and querying GFA link data. It uses aiosqlite for non-blocking operations within an asyncio event loop and serializes writes via an internal FIFO queue to prevent SQLite lock contention.
Attributes
- db_filePath
Path to the SQLite database file.
- timeoutfloat
Maximum timeout (in seconds) for SQLite lock acquisition.
- loggerlogging.Logger
Logger instance.
- _connOptional[aiosqlite.Connection]
Shared SQLite connection (or None if not connected).
- _write_queueasyncio.Queue
Asynchronous queue for batches of links to be inserted. Max size 100.
- _sql_taskOptional[asyncio.Task]
Background task consuming the queue and writing to the database.
- _shutdownbool
Flag to signal shutdown to the writer task.
- __init__(db_file, timeout=30.0)[source]
Initialize the instance without opening the connection. Scheme is created upon the first call to connect().
- Parameters:
db_file (Path) – Path to the SQLite file to be used as backend.
timeout (float, optional) – Maximum wait time for SQLite locks (default 30.0s).
- async connect()[source]
Connect to the SQLite db (if not already connected), configure PRAGMA settings, create the ‘links’ table schema (if it doesn’t exist), and start the SQL writer task. This method is idempotent: if already connected, it does nothing.
- Return type:
None
- async batch_insert_links(links)[source]
Enqueue a batch of links for non-blocking insertion. Must be called after await connect().
- async create_indexes()[source]
Create indexes on seg_id_1 and seg_id_2 to accelerate queries. Should ideally be called after all data insertions are complete.
- Return type:
None
- async query_links_by_segment(segment_id)[source]
Retrieves all links where segment_id appears as seg_id_1 or seg_id_2.
- async test_query_links(segment_id)[source]
Retrieve and categorize links related to a given segment. - “before”: links where segment_id is seg_id_2. - “after”: links where segment_id is seg_id_1.
- async find_children_and_grandchildren(node_id)[source]
Find direct successors (children) and second-degree successors (grandchildren) of a segment. A child is seg_id_2 where node_id is seg_id_1. A grandchild is a child of a child.
- async close()[source]
Properly shut down the database: - Signal the SQL writer task to stop. - Wait for the writer task to finish processing its queue (with timeout). - Cancel the task if it doesn’t finish in time. - Create indexes (important to do this after all writes). - Close the SQLite connection.
- Return type:
None
- class gratools.SubGraph(bam_path, bed_path, logger=<factory>, sample_name=None, sample_name_query=None, chromosome_query=None, start_query=None, stop_query=None, offset_first=0, offset_last=0, add_start_bases_first_segment=0, intersect_bed=None, segment_id_set=<factory>, segment_id_first_query=None, segment_id_first_strand=None, segment_id_last_query=None, segment_id_last_strand=None, segment_id_first=None, segment_id_last=None, works_path=None, merge=None, build_fasta_flag=False, gfa_walk_list=<factory>, gfa_link_list=<factory>, gfa_segment_list=<factory>, dict_segments_samples=<factory>, dict_segments_sequence=<factory>, sequences_list=<factory>, progress_dict=None, task_id=None, regions=None, intersected_results_by_regions=None)[source]
Bases:
objectA class representing a subgraph in a genomic analysis pipeline. It handles operations like BED file intersection, GFA walk/link/segment building, and FASTA sequence generation for a specific sample or region.
Attributes
- bam_pathPath
The file path to the BAM file.
- bed_pathPath
The file path to the BED file.
- loggerlogging.Logger
The logger instance for logging messages. Defaults to a logger named “GraTools”.
- sample_nameOptional[str]
The name of the sample, derived from bed_path. Default to None.
- sample_name_queryOptional[str]
The name of the sample being queried. Default to None.
- chromosome_queryOptional[str]
The chromosome name for the query. Default to None.
- start_queryOptional[int]
The start position on the chromosome for the query. Default to None.
- stop_queryOptional[int]
The stop position on the chromosome for the query. Default to None.
- offset_firstint
The offset for the first segment in the query region. Default to 0.
- offset_lastint
The offset for the last segment in the query region. Default to 0.
- add_start_bases_first_segmentint
Additional start bases for the first segment (usage seems specific, consider clarifying). Default to 0.
- intersect_bedOptional[BedTool]
The BedTool object for intersected BED regions. Default to None.
- segment_id_setSet[str]
A set storing unique segment IDs encountered during walk building.
- segment_id_first_queryOptional[str]
The ID of the first segment in the query region. Default to None.
- segment_id_first_strandOptional[str]
The strand (‘+’ or ‘-’) of the first segment ID in the query. Default to None.
- segment_id_last_queryOptional[str]
The ID of the last segment ID in the query region. Default to None.
- segment_id_last_strandOptional[str]
The strand (‘+’ or ‘-’) of the last segment ID in the query. Default to None.
- segment_id_firstOptional[str]
The first segment ID encountered for the current sample (might be same as query). Default to None.
- segment_id_lastOptional[str]
The last segment ID encountered for the current sample (might be same as query). Default to None.
- works_pathOptional[Path]
The working directory path. Defaults to None.
- mergeOptional[int]
Merge distance parameter for BED operations (e.g., bedtools merge -d). Default to None.
- build_fasta_flagOptional[bool]
Flag indicating whether FASTA sequences should be built. Default to False.
- gfa_walk_listList[str]
A list of GFA walk strings (W lines). Default to an empty list.
- gfa_link_listList[str]
A list of GFA link strings (L lines). Default to an empty list.
- gfa_segment_listList[str]
A list of GFA segment strings (S lines). Default to an empty list.
- dict_segments_samplesdefaultdict[str, List[str]]
A dictionary mapping segment IDs to a list of sample identifiers. Default to an empty defaultdict.
- dict_segments_sequencedefaultdict[str, str]
A dictionary mapping segment IDs to their sequences. Default to an empty defaultdict.
- sequences_listList[SeqRecord]
A list of Biopython SeqRecord objects for generated FASTA sequences. Default to an empty list.
- progress_dictOptional[Dict]
A dictionary to track progress, typically for multi-processing. Default to None.
- task_idOptional[TaskID]
The task ID for progress tracking with rich.progress. Default to None.
- regionsOptional[List[Dict[str, Any]]]
A list of regions (dictionaries with ‘chromosome’, ‘start’, ‘stop’). Default to None.
- intersected_results_by_regionsOptional[BedTool]
A BedTool object containing combined intersected results for all regions. Default to None.
- bam_path: Path = <dataclasses._MISSING_TYPE object>
- bed_path: Path = <dataclasses._MISSING_TYPE object>
- logger: Logger = <dataclasses._MISSING_TYPE object>
- offset_first: int = 0
- offset_last: int = 0
- add_start_bases_first_segment: int = 0
- intersect_bed: BedTool | None = None
- dict_segments_samples: defaultdict = <dataclasses._MISSING_TYPE object>
- dict_segments_sequence: defaultdict = <dataclasses._MISSING_TYPE object>
- sequences_list: List[SeqRecord] = <dataclasses._MISSING_TYPE object>
- task_id: TaskID | None = None
- intersected_results_by_regions: BedTool | None = None
- compute_intersection()[source]
Compute the intersection of BED regions for each region in self.regions. Store the results in self.intersected_results_by_regions.
- Return type:
None
- build_walks()[source]
Build GFA walks (W lines) and links (L lines) from the intersected BED regions.
- Return type:
None
- build_segments()[source]
Build GFA segments (S lines) from the BAM file for segments in self.segment_id_set.
- Return type:
None
- filter_bed_with_awk()[source]
Filter a BED file using an awk command line to extract only lines containing an ID of interest.
- Return type:
BedTool
- get_chr_pos(progress_dict, task_id)[source]
Identify chromosomal regions corresponding to segments in self.segment_id_set, then compute intersections, and finally build walks and segments for these regions.
Parameters
- progress_dictOptional[Dict]
A dictionary to track progress for multiprocessing.
- task_idOptional[TaskID]
The task ID for rich.progress tracking.
- Parameters:
progress_dict (Dict | None)
task_id (TaskID | None)
- Return type:
None
- build_fasta()[source]
Build FASTA sequences from the GFA walks (self.gfa_walk_list).
This method recovers FASTA sequences by processing each walk, extracting the constitutive segments, and applying specific offsets if the current sample is the query sample (self.sample_name == self.sample_name_query). The offsets trim segment sequences at the beginning or end of the walk to match the precise query region boundaries (self.start_query, self.stop_query). The resulting sequences are stored as SeqRecord objects in self.sequences_list.
- Return type:
None
- Parameters:
bam_path (Path)
bed_path (Path)
logger (Logger)
sample_name (str | None)
sample_name_query (str | None)
chromosome_query (str | None)
start_query (int | None)
stop_query (int | None)
offset_first (int)
offset_last (int)
add_start_bases_first_segment (int)
intersect_bed (BedTool | None)
segment_id_first_query (str | None)
segment_id_first_strand (str | None)
segment_id_last_query (str | None)
segment_id_last_strand (str | None)
segment_id_first (str | None)
segment_id_last (str | None)
works_path (Path | None)
merge (int | None)
build_fasta_flag (bool | None)
dict_segments_samples (defaultdict)
dict_segments_sequence (defaultdict)
sequences_list (List[SeqRecord])
progress_dict (Dict | None)
task_id (TaskID | None)
intersected_results_by_regions (BedTool | None)
- class gratools.GFA(gfa_path, threads=1, logger=<factory>, disable_progress_flag=False, gfa_name=None, version=None, header_gfa=<factory>, sample_reference=None, bam_segments_file=None, dict_samples_chrom=<factory>, dict_segments_size=<factory>, dict_segments_samples=<factory>, dict_samples_bed=<factory>, works_path=None, bed_path=None, bam_path=None, found_minigraph=False, index_links=False, db_links=None, segment_count=0, total_segment_length=0, link_count=0, degrees=<factory>, walks_count=0, max_walk_rank=0, sum_rank0_length=0, input_genome_size=0, walks_info=<factory>, inverted_links_count=0, negative_links_count=0, self_links_count=0, isolated_segments=<factory>, shared_executor=None, progress=None, line_type_counts=<factory>, header_gfa_file=None, stats_file=None, db_file_path=None)[source]
Bases:
objectManage parsing of a GFA (Graphical Fragment Assembly) file, compute statistics, and generate related files such as BAM-containing segments files and BED files for path per sample. Fill an asynchronous database for links and asynchronous BED writer.
Attributes
- gfa_pathPath
Path to the input GFA file (can be .gfa or .gfa.gz).
- threadsint, optional
Number of threads for operations like BAM file processing. Default is 1.
- loggerlogging.Logger
Logger object. Default is a logger named “GraTools”.
- gfa_nameOptional[str]
Name of the GFA file derived from gfa_path (without extensions). Auto-initialized.
- versionOptional[str]
GFA version extracted from the header (e.g., “1.0”). Auto-initialized.
- header_gfaList[str]
List of header lines (H lines) from the GFA file. Auto-initialized.
- sample_referenceOptional[str]
Reference sample name, potentially from GFA header (RS tag). Auto-initialized.
- bam_segments_fileOptional[Path]
Path to the BAM file where segments (S lines) will be written. Auto-initialized.
- dict_samples_chromdefaultdict[str, OrderedDict[str, List[str]]]
Map sample names to an OrderedDict of chromosome names, which in turn map to a list of “starttstop” fragment strings derived from Walk (W) lines. Auto-initialized.
- dict_segments_sizedefaultdict[str, int]
Map segment IDs and their length (in base pairs). Auto-initialized.
- dict_segments_samplesdefaultdict[str, List[str]]
Map segment IDs to a list of sample identifiers (“sample;chromosome;haplotype”) that contain the segment. Auto-initialized.
- dict_samples_beddefaultdict[str, OrderedDict[str, Path]]
(Note: This attribute is not directly populated by the current parsing logic. It was likely intended to track the paths of generated BED files. The AsyncBedWriter internally manages these paths. This attribute might be redundant or used for post-processing tracking). Auto-initialized.
- works_pathOptional[Path]
Path to the working directory (e.g., “…/{gfa_name}_GraTools_INDEX”). Auto-initialized.
- bed_pathOptional[Path]
Path to the subdirectory for BED files within works_path. Auto-initialized.
- bam_pathOptional[Path]
Path to the subdirectory for BAM files within works_path. Auto-initialized.
- found_minigraphbool
Flag indicating if a sample named ‘MINIGRAPH’ (case-insensitive) was found in Walk lines. Defaults to False.
- index_linksbool, optional
If True, GFA links (L lines) are stored in an SQLite database. Defaults to True.
- db_linksOptional[AsyncGfaDatabase]
Asynchronous database handler for GFA links. Auto-initialized if index_links is True.
- segment_countint
Total number of segments (S lines) processed. Defaults to 0.
- total_segment_lengthint
Sum of lengths of all segments. Defaults to 0.
- link_countint
Total number of links (L lines) processed. Defaults to 0.
- degreesdefaultdict[str, int]
Maps segment IDs to their degree (number of links connected). Defaults to an empty defaultdict.
- walks_countint
Total number of walks (W lines) processed. Defaults to 0.
- max_walk_rankint
Maximum number of segments in any single walk. Defaults to 0.
- sum_rank0_lengthint
Sum of lengths of the first segments of all walks. Defaults to 0.
- input_genome_sizeint
Cumulative size of all paths (sum of segment lengths along each walk). Defaults to 0.
- walks_infoList[Dict[str, Any]]
List of dictionaries, each containing info for a walk (Path name, Sequence length, Num Segments). Defaults to an empty list.
- inverted_links_countint
Count of links where orientations differ (e.g., S1+ -> S2-). Defaults to 0.
- negative_links_countint
Count of links where both segments have negative orientation (S1- -> S2-). Defaults to 0.
- self_links_countint
Count of links where a segment links to itself (S1 -> S1). Defaults to 0.
- isolated_segmentsSet[str]
Set of segment IDs that have no links connected to them. Initialized with all segments, then linked ones removed. Defaults to an empty set.
- shared_executorOptional[ThreadPoolExecutor]
Executor for running synchronous tasks in threads. Auto-initialized.
- progressOptional[Progress]
Rich Progress instance for displaying progress. Auto-initialized.
- line_type_countsCounter
Counts of each GFA line type (H, S, L, W, P, C, E, U). Auto-initialized.
- header_gfa_fileOptional[Path]
Path to where the GFA header is saved. Auto-initialized.
- stats_fileOptional[Path]
Path to where GFA statistics are saved. Auto-initialized.
- db_file_pathOptional[Path]
Path to the SQLite database file for links. Auto-initialized.
- RE_ORIENTED_SEG_GT_LTre.Pattern
Compiled regular expression for parsing oriented segments using ‘>’ and ‘<’. Auto-initialized.
- RE_ORIENTED_SEG_PLUS_MINUSre.Pattern
Compiled regular expression for parsing oriented segments using ‘+’ and ‘-’. Auto-initialized.
- disable_progress_flag: bool
If True, progress bars are disabled. Defaults to False.
- gfa_path: Path = <dataclasses._MISSING_TYPE object>
- threads: int = 1
- logger: Logger = <dataclasses._MISSING_TYPE object>
- dict_samples_chrom: defaultdict = <dataclasses._MISSING_TYPE object>
- dict_segments_size: defaultdict = <dataclasses._MISSING_TYPE object>
- dict_segments_samples: defaultdict = <dataclasses._MISSING_TYPE object>
- dict_samples_bed: defaultdict = <dataclasses._MISSING_TYPE object>
- found_minigraph: bool = False
- index_links: bool = False
- db_links: AsyncGfaDatabase | None = None
- segment_count: int = 0
- total_segment_length: int = 0
- link_count: int = 0
- degrees: defaultdict = <dataclasses._MISSING_TYPE object>
- walks_count: int = 0
- max_walk_rank: int = 0
- sum_rank0_length: int = 0
- input_genome_size: int = 0
- inverted_links_count: int = 0
- negative_links_count: int = 0
- self_links_count: int = 0
- isolated_segments: set = <dataclasses._MISSING_TYPE object>
- shared_executor: ThreadPoolExecutor | None = None
- progress: Progress | None = None
- line_type_counts: Counter = <dataclasses._MISSING_TYPE object>
- RE_ORIENTED_SEG_GT_LT: Pattern = <dataclasses._MISSING_TYPE object>
- RE_ORIENTED_SEG_PLUS_MINUS: Pattern = <dataclasses._MISSING_TYPE object>
- save_header()[source]
Save the GFA header lines (H lines) to a text file.
- Return type:
None
- sort_file_in_place()[source]
Sort a file in place using Unix commands without creating a temporary copy. The file must be in the format: sample chromosome start end
- async parse_gfa()[source]
Parses the GFA file line by line, processing headers, segments, links, and walks. Segments are written to a BAM file. Links are (optionally) stored in an async SQLite DB. Walk information is used to generate data for BED files, written by AsyncBedWriter.
- Return type:
None
- run()[source]
Synchronous entry point to orchestrate GFA parsing and BED file sorting. Sets up an asyncio event loop and runs the asynchronous parse_gfa method. Then, sorts the generated BED files using a ThreadPoolExecutor.
- tag_bam()[source]
Tags the generated BAM segments file with sample walk information (SW tag). This uses the GratoolsBam class to perform the tagging. The original BAM segments file is overwritten with the tagged version.
- Return type:
None
- Parameters:
gfa_path (Path)
threads (int)
logger (Logger)
disable_progress_flag (bool | None)
gfa_name (str | None)
version (str | None)
sample_reference (str | None)
bam_segments_file (Path | None)
dict_samples_chrom (defaultdict)
dict_segments_size (defaultdict)
dict_segments_samples (defaultdict)
dict_samples_bed (defaultdict)
works_path (Path | None)
bed_path (Path | None)
bam_path (Path | None)
found_minigraph (bool)
index_links (bool)
db_links (AsyncGfaDatabase | None)
segment_count (int)
total_segment_length (int)
link_count (int)
degrees (defaultdict)
walks_count (int)
max_walk_rank (int)
sum_rank0_length (int)
input_genome_size (int)
inverted_links_count (int)
negative_links_count (int)
self_links_count (int)
isolated_segments (set)
shared_executor (ThreadPoolExecutor | None)
progress (Progress | None)
line_type_counts (Counter)
header_gfa_file (Path | None)
stats_file (Path | None)
db_file_path (Path | None)
- class gratools.GratoolsBam(bam_path, threads=1, logger=<factory>, suffix=None, works_path=None, gfa_name=None, tagging=False, disable_progress_flag=False)[source]
Bases:
objectHandles operations related to BAM files in the GraTools context, such as indexing, extracting segment information, tagging segments, and performing various analyses (core/dispensable ratio, depth statistics, etc.).
Attributes
- bam_pathPath
Path to the BAM file.
- threadsint, optional
Number of threads for BAM operations (e.g., reading, indexing). Defaults to 1.
- loggerlogging.Logger
Logger instance. Defaults to a logger named “GraTools”.
- suffixOptional[str], optional
Suffix to append to output filenames generated by analyses. Defaults to None.
- works_pathOptional[Path], optional
Working directory path for saving output files. Defaults to None (uses BAM parent dir).
- gfa_nameOptional[str], optional
Name of the associated GFA file (used for naming output files). Defaults to None.
- taggingbool, optional
If True, indicates that operations might modify tags, potentially requiring re-indexing. Used by index_bam to decide if indexing is needed. Defaults to False.
- progressOptional[Progress]
Rich Progress instance for displaying progress. Auto-initialized.
- disable_progress_flag Optional[bool], optional
Flag to disable progress bar. Defaults to False.
- bam_path: Path = <dataclasses._MISSING_TYPE object>
- threads: int = 1
- logger: Logger = <dataclasses._MISSING_TYPE object>
- tagging: bool = False
- disable_progress_flag: bool = False
- progress: Progress | None = <dataclasses._MISSING_TYPE object>
- index_bam()[source]
Indexes the BAM file using pysam.index if the index is missing or outdated. The index file will have a ‘.bai’ or ‘.crai’ extension depending on BAM format.
- Return type:
None
- build_segments(list_segments=None)[source]
Extracts specified segments from a BAM file and reconstructs their GFA S-line representation. Also populates dictionaries for segment samples and sequences.
Parameters
- list_segmentsOptional[List[str]], optional
A list of segment IDs (query_name in BAM) to extract. If None or empty, this method might process all segments or return empty results, depending on intended behavior (current pysam.view call implies it needs a list).
Returns
- Tuple[List[str], defaultdict[str, List[str]], defaultdict[str, str]]
gfa_s_lines_list: List of strings, each a GFA S-line.
dict_seg_samples: defaultdict mapping segment ID to list of “sample;chrom;haplo” strings from SW tag.
dict_seg_sequence: defaultdict mapping segment ID to its sequence.
- Parameters:
- Return type:
- tag(dict_segments_samples, nb_segments)[source]
Adds or updates the ‘SW’ (Sample Walks) tag to segments in the BAM file. This version uses an integer-to-string mapping for walk paths to improve performance.
The SW tag stores a comma-separated list of “sample;chromosome;haplotype” strings indicating which walks/paths contain the segment. The original BAM file is overwritten with the tagged version.
Parameters
- dict_segments_samplesDict[str, List[int]]
Dictionary mapping segment IDs (query_name) to a list of integer IDs representing the walk paths.
- nb_segmentsint
The total number of segments in the BAM file.
Returns
- Path
The path to the (now tagged and re-indexed) BAM file.
Raises
- FileNotFoundError
If the input BAM file does not exist.
- Exception
If errors occur during BAM reading, writing, or renaming.
- core_dispensable_ratio(nb_samples_gfa, input_as_number, shared_min_cutoff=1, specific_max_cutoff=None, filter_min_len=1)[source]
Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in. Saves results to a CSV file.
Parameters
- nb_samples_gfaint
Total number of unique samples present in the GFA (used for percentage calculation).
- input_as_numberbool
If True, shared_min_cutoff and specific_max_cutoff are treated as absolute counts of samples. If False, they are treated as percentages of nb_samples_gfa.
- shared_min_cutoffint, optional
Minimum number/percentage of samples a segment must be in to be considered “shared” (core). Defaults to 1.
- specific_max_cutoffOptional[int], optional
Maximum number/percentage of samples a segment can be in to be considered “specific” (dispensable). If None, specific analysis might be skipped or use a default (e.g., 1 if input_as_number). Defaults to None.
- filter_min_lenint, optional
Minimum length (bp) for a segment to be included in the filtered analysis. Defaults to 1 (no length filter).
- depth_nodes_stat(nb_samples_gfa, filter_min_len=1)[source]
Calculates and displays statistics about segment depth (number of unique samples a segment is found in). Outputs results to console and a CSV file.
Parameters
- nb_samples_gfaint
Total number of unique samples in the GFA, used for context if needed (not directly in calcs here).
- filter_min_lenint, optional
Minimum length (bp) for a segment to be included in the filtered depth analysis. Defaults to 1 (no effective length filter).
- get_specific_and_shared_segments(samples_list_A, samples_list_B=None, filter_min_len=None, output_csv=None)[source]
Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.
Parameters
- samples_list_AList[str]
A list of sample names. A segment is “shared” if it is present in ALL samples in this list.
- samples_list_BOptional[List[str]], optional
An optional second list of sample names. If provided, a segment is “specific” if it is shared by all in samples_list_A AND absent from ALL samples in this list.
- filter_min_lenOptional[int], optional
If set, only segments with a length greater than or equal to this value will be considered.
- output_csvOptional[bool], optional
If True, the function will return sets of the shared and specific segment IDs.
Returns
- Tuple[Set[str], Set[str]]
A set of segment IDs that are shared by all samples in samples_list_A.
A set of segment IDs that are specific to samples_list_A relative to samples_list_B. (This set is a subset of the first one).
- get_segments_and_positions_by_depth(total_gfa_samples, input_as_number, lower_bound_depth, upper_bound_depth, filter_min_len, bed_path=None)[source]
Finds segments within a specific depth range and retrieves their genomic positions from BED files.
This function performs two main steps:
Scans the BAM file to identify segments that meet the specified depth and length criteria.
For those segments, it efficiently queries the relevant BED files to find their
exact genomic coordinates (chromosome, start, end).
Parameters
- total_gfa_samplesint
Total number of unique samples in the GFA, used for percentage calculations.
- input_as_numberbool
If True, depth bounds are absolute counts; if False, they are percentages of total_gfa_samples.
- lower_bound_depthint
Minimum sample depth (count or percentage) for a segment to be included.
- upper_bound_depthint
Maximum sample depth (count or percentage) for a segment to be included.
- filter_min_lenint
Minimum length in base pairs for a segment to be considered.
- bed_pathPath, optional
Path to the directory containing the sample-specific BED files.
Returns
- Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]
A tuple containing two dictionaries: 1. segments_with_depth: {segment_id: depth} for all segments matching the criteria. 2. segment_locations: {segment_id: {sample_name: [(chrom, start, end), …]}}
- export_nodes_to_csv(output_csv_path, core_threshold_percent=0.95)[source]
Exports enhanced node information to a CSV file using an efficient, single-pass approach.
This method aggregates data in memory before creating a final Pandas DataFrame, making it much more memory-efficient than building a list of all records. It correctly parses the ‘SW’ tag to build a list of samples for each unique node.
Parameters
- output_csv_pathPath
The path where the output CSV file will be saved.
- core_threshold_percentfloat, optional
The percentage of total samples above which a node is considered ‘core’. Defaults to 0.95.
- class gratools.ThreadedRichHandler(*args, **kwargs)[source]
Bases:
RichHandlerCustom RichHandler that processes log records in a separate thread to prevent blocking the main application thread, especially during I/O-bound logging operations (like writing to console or complex formatting).
It also includes a feature to trigger a program exit if a log record of ERROR level or higher is emitted through it.
Attributes
- runningbool
A flag indicating whether the log processing worker thread should continue running.
- log_queuequeue.Queue[logging.LogRecord] # Type hint for clarity
A thread-safe queue used to buffer log records before they are processed by the worker thread.
- worker_threadthreading.Thread
The background thread responsible for consuming log records from log_queue.
- critical_error_occurredbool
A flag set to True if an ERROR or CRITICAL log has been processed, leading to program termination.
- __init__(*args, **kwargs)[source]
Initializes the ThreadedRichHandler.
Sets up the log queue and starts the background worker thread.
- Return type:
None
- emit(record)[source]
Queues a log record for processing by the worker thread.
If the log record’s level is ERROR or higher, this method sets a flag to indicate a critical error and initiates the process to stop the log processing and exit the application.
Parameters
- recordlogging.LogRecord
The log record to be emitted.
- Parameters:
record (LogRecord)
- Return type:
None
- stop_processing()[source]
Signals the worker thread to stop and waits for it to terminate.
This ensures that the queue is flushed of pending log records before the thread exits.
- Return type:
None
- close()[source]
Closes the handler, ensuring the worker thread is stopped and resources are released.
- Return type:
None
- gratools.configure_logger(name, log_dir_path, verbosity_level, file_suffix='')[source]
Configures and returns a logger instance with specified settings.
The logger will output to: 1. The console (via RichHandler for formatted, colorful output). 2. A general log file (e.g., ‘name_log.o’). 3. An error log file for WARNING and higher messages (e.g., ‘name_log.e’).
Parameters
- namestr
The name for the logger (e.g., “GraTools”).
- log_dir_pathPath
The directory path where log files will be stored.
- verbosity_levelstr
The logging verbosity level (e.g., “DEBUG”, “INFO”, “ERROR”). This sets the minimum level for messages to be processed by the logger.
- file_suffixstr, optional
An optional suffix to append to the base name of log files. Defaults to an empty string.
Returns
- logging.Logger
The configured logger instance.
- gratools.update_logger_file_suffix(logger, new_file_suffix)[source]
Updates file handlers of a logger to use a new file suffix. Existing log content from the old file (if any) is copied to the new file location before switching. The old file is then deleted.
This is useful if, for instance, a specific operation (like a query) should have its logs in a uniquely named file determined mid-execution.
Parameters
- loggerlogging.Logger
The logger instance whose file handlers need updating.
- new_file_suffixstr
The new suffix to be incorporated into the log filenames. Example: if old was “main_log.log”, new_suffix=”_query123”, new becomes “main_log_query123.log”.
Returns
- logging.Logger
The same logger instance, now with updated file handlers.
- class gratools.CustomCommand(*args, **kwargs)[source]
Bases:
CommandCustom Command class that applies global context settings (for help formatting) and prepends a specific header to the help message of individual commands.
- __init__(*args, **kwargs)[source]
Initializes the CustomCommand.
Ensures that the predefined CONTEXT_SETTINGS are applied by default to this command.
- invoke(ctx)[source]
Mesure et affiche le temps d’exécution autour de l’appel réel.
- get_help(ctx)[source]
Overrides the default help generation to prepend a custom header.
The header is printed directly to the console using shared_console before the standard help text is generated and returned.
- Parameters:
ctx (click.Context) – The current Click context.
- Returns:
str – The formatted help text, including the prepended header.
- Return type:
- class gratools.CustomGroup(*args, **kwargs)[source]
Bases:
GroupCustom Group class that applies global context settings, prepends a header to its own help message, and ensures that all subcommands added via its command decorator use CustomCommand by default.
- __init__(*args, **kwargs)[source]
Initializes the CustomGroup.
Ensures that the predefined CONTEXT_SETTINGS are applied by default to this group and its subcommands (if they don’t override).
- command(*args, **kwargs)[source]
Overrides the default command decorator registration.
This ensures that any command registered using this group’s command method will automatically use CustomCommand as its class, thereby inheriting the custom help formatting and header.
- Parameters:
*args – Positional arguments for the command decorator.
**kwargs – Keyword arguments for the command decorator.
- Returns:
Callable – The decorator that registers the command.
- gratools.validate_percentage_or_int(ctx, param, value)[source]
Click callback for validating that an option’s value is either an integer or a float representing a percentage (between 0.0 and 1.0 inclusive).
This function is intended to be used as a callback for a Click option.
- Parameters:
- Returns:
int | float –
- The validated value, converted to int if it’s a whole number,
or float if it’s a percentage (0.0-1.0).
- Raises:
click.BadParameter – If the value is not a valid integer or a float between 0.0 and 1.0.
- Return type:
—
Use the Table of Contents in the sidebar to jump quickly to a specific function or class within these modules. Each member is indexed and searchable.