GraTools Package

Developer API Reference and Internal Module Documentation

—

💻 Developer Overview

This page contains the auto-generated API documentation for GraTools. It is intended for developers who wish to contribute to the project or use GraTools as a Python library. All modules are documented with their respective members, functions, and inheritance.

—

Core Modules

This module handles the core pangenome graph structures and GFA parsing logic.

🏗️ gratools.Graph

class gratools.Graph.LinkInfo(seg_id_1, orient_seg_1, orient_key_seg_1, seg_id_2, orient_seg_2, orient_key_seg_2)

Bases: tuple

Information about a link between two segments.

Attributes

seg_id_1str: Identifier of the first segment.
orient_seg_1int: Orientation of the first segment (+1 or -1).
orient_key_seg_1int: Orientation key for the first segment (often the same as orient_seg_1).
seg_id_2str: Identifier of the second segment.
orient_seg_2int: Orientation of the second segment (+1 or -1).
orient_key_seg_2int: Orientation key for the second segment (often the same as orient_seg_2).

orient_key_seg_1: Alias for field number 2

orient_key_seg_2: Alias for field number 5

orient_seg_1: Alias for field number 1

orient_seg_2: Alias for field number 4

seg_id_1: Alias for field number 0

seg_id_2: Alias for field number 3

class gratools.Graph.SubGraph(bam_path, bed_path, logger=<factory>, sample_name=None, sample_name_query=None, chromosome_query=None, start_query=None, stop_query=None, offset_first=0, offset_last=0, add_start_bases_first_segment=0, intersect_bed=None, segment_id_set=<factory>, segment_id_first_query=None, segment_id_first_strand=None, segment_id_last_query=None, segment_id_last_strand=None, segment_id_first=None, segment_id_last=None, works_path=None, merge=None, build_fasta_flag=False, gfa_walk_list=<factory>, gfa_link_list=<factory>, gfa_segment_list=<factory>, dict_segments_samples=<factory>, dict_segments_sequence=<factory>, sequences_list=<factory>, progress_dict=None, task_id=None, regions=None, intersected_results_by_regions=None)[source]

Bases: object

A class representing a subgraph in a genomic analysis pipeline. It handles operations like BED file intersection, GFA walk/link/segment building, and FASTA sequence generation for a specific sample or region.

Attributes

bam_pathPath: The file path to the BAM file.
bed_pathPath: The file path to the BED file.
loggerlogging.Logger: The logger instance for logging messages. Defaults to a logger named “GraTools”.
sample_nameOptional[str]: The name of the sample, derived from bed_path. Default to None.
sample_name_queryOptional[str]: The name of the sample being queried. Default to None.
chromosome_queryOptional[str]: The chromosome name for the query. Default to None.
start_queryOptional[int]: The start position on the chromosome for the query. Default to None.
stop_queryOptional[int]: The stop position on the chromosome for the query. Default to None.
offset_firstint: The offset for the first segment in the query region. Default to 0.
offset_lastint: The offset for the last segment in the query region. Default to 0.
add_start_bases_first_segmentint: Additional start bases for the first segment (usage seems specific, consider clarifying). Default to 0.
intersect_bedOptional[BedTool]: The BedTool object for intersected BED regions. Default to None.
segment_id_setSet[str]: A set storing unique segment IDs encountered during walk building.
segment_id_first_queryOptional[str]: The ID of the first segment in the query region. Default to None.
segment_id_first_strandOptional[str]: The strand (‘+’ or ‘-’) of the first segment ID in the query. Default to None.
segment_id_last_queryOptional[str]: The ID of the last segment ID in the query region. Default to None.
segment_id_last_strandOptional[str]: The strand (‘+’ or ‘-’) of the last segment ID in the query. Default to None.
segment_id_firstOptional[str]: The first segment ID encountered for the current sample (might be same as query). Default to None.
segment_id_lastOptional[str]: The last segment ID encountered for the current sample (might be same as query). Default to None.
works_pathOptional[Path]: The working directory path. Defaults to None.
mergeOptional[int]: Merge distance parameter for BED operations (e.g., bedtools merge -d). Default to None.
build_fasta_flagOptional[bool]: Flag indicating whether FASTA sequences should be built. Default to False.
gfa_walk_listList[str]: A list of GFA walk strings (W lines). Default to an empty list.
gfa_link_listList[str]: A list of GFA link strings (L lines). Default to an empty list.
gfa_segment_listList[str]: A list of GFA segment strings (S lines). Default to an empty list.
dict_segments_samplesdefaultdict[str, List[str]]: A dictionary mapping segment IDs to a list of sample identifiers. Default to an empty defaultdict.
dict_segments_sequencedefaultdict[str, str]: A dictionary mapping segment IDs to their sequences. Default to an empty defaultdict.
sequences_listList[SeqRecord]: A list of Biopython SeqRecord objects for generated FASTA sequences. Default to an empty list.
progress_dictOptional[Dict]: A dictionary to track progress, typically for multi-processing. Default to None.
task_idOptional[TaskID]: The task ID for progress tracking with rich.progress. Default to None.
regionsOptional[List[Dict[str, Any]]]: A list of regions (dictionaries with ‘chromosome’, ‘start’, ‘stop’). Default to None.
intersected_results_by_regionsOptional[BedTool]: A BedTool object containing combined intersected results for all regions. Default to None.

bam_path: Path = <dataclasses._MISSING_TYPE object>

bed_path: Path = <dataclasses._MISSING_TYPE object>

logger: Logger = <dataclasses._MISSING_TYPE object>

sample_name: str | None = None

sample_name_query: str | None = None

chromosome_query: str | None = None

start_query: int | None = None

stop_query: int | None = None

offset_first: int = 0

offset_last: int = 0

add_start_bases_first_segment: int = 0

intersect_bed: BedTool | None = None

segment_id_set: Set[str] = <dataclasses._MISSING_TYPE object>

segment_id_first_query: str | None = None

segment_id_first_strand: str | None = None

segment_id_last_query: str | None = None

segment_id_last_strand: str | None = None

segment_id_first: str | None = None

segment_id_last: str | None = None

works_path: Path | None = None

merge: int | None = None

build_fasta_flag: bool | None = False

gfa_walk_list: List[str] = <dataclasses._MISSING_TYPE object>

gfa_link_list: List[str] = <dataclasses._MISSING_TYPE object>

gfa_segment_list: List[str] = <dataclasses._MISSING_TYPE object>

dict_segments_samples: defaultdict = <dataclasses._MISSING_TYPE object>

dict_segments_sequence: defaultdict = <dataclasses._MISSING_TYPE object>

sequences_list: List[SeqRecord] = <dataclasses._MISSING_TYPE object>

progress_dict: Dict | None = None

task_id: TaskID | None = None

regions: List[Dict[str, Any]] | None = None

intersected_results_by_regions: BedTool | None = None

compute_intersection()[source]

Compute the intersection of BED regions for each region in self.regions. Store the results in self.intersected_results_by_regions.

Return type:: None

build_walks()[source]

Build GFA walks (W lines) and links (L lines) from the intersected BED regions.

Return type:: None

build_segments()[source]

Build GFA segments (S lines) from the BAM file for segments in self.segment_id_set.

Return type:: None

filter_bed_with_awk()[source]

Filter a BED file using an awk command line to extract only lines containing an ID of interest.

Return type:: BedTool

get_chr_pos(progress_dict, task_id)[source]

Identify chromosomal regions corresponding to segments in self.segment_id_set, then compute intersections, and finally build walks and segments for these regions.

Parameters

progress_dictOptional[Dict]: A dictionary to track progress for multiprocessing.
task_idOptional[TaskID]: The task ID for rich.progress tracking.

Parameters:

progress_dict (Dict | None)
task_id (TaskID | None)

Return type:

None

build_fasta()[source]

Build FASTA sequences from the GFA walks (self.gfa_walk_list).

This method recovers FASTA sequences by processing each walk, extracting the constitutive segments, and applying specific offsets if the current sample is the query sample (self.sample_name == self.sample_name_query). The offsets trim segment sequences at the beginning or end of the walk to match the precise query region boundaries (self.start_query, self.stop_query). The resulting sequences are stored as SeqRecord objects in self.sequences_list.

Return type:: None

Parameters:

bam_path (Path)
bed_path (Path)
logger (Logger)
sample_name (str | None)
sample_name_query (str | None)
chromosome_query (str | None)
start_query (int | None)
stop_query (int | None)
offset_first (int)
offset_last (int)
add_start_bases_first_segment (int)
intersect_bed (BedTool | None)
segment_id_set (Set[str])
segment_id_first_query (str | None)
segment_id_first_strand (str | None)
segment_id_last_query (str | None)
segment_id_last_strand (str | None)
segment_id_first (str | None)
segment_id_last (str | None)
works_path (Path | None)
merge (int | None)
build_fasta_flag (bool | None)
gfa_walk_list (List[str])
gfa_link_list (List[str])
gfa_segment_list (List[str])
dict_segments_samples (defaultdict)
dict_segments_sequence (defaultdict)
sequences_list (List[SeqRecord])
progress_dict (Dict | None)
task_id (TaskID | None)
regions (List[Dict[str, Any]] | None)
intersected_results_by_regions (BedTool | None)

class gratools.Graph.AsyncGfaDatabase(db_file, timeout=30.0)[source]

Bases: object

Manage an asynchronous SQLite database for storing and querying GFA link data. It uses aiosqlite for non-blocking operations within an asyncio event loop and serializes writes via an internal FIFO queue to prevent SQLite lock contention.

Attributes

db_filePath: Path to the SQLite database file.
timeoutfloat: Maximum timeout (in seconds) for SQLite lock acquisition.
loggerlogging.Logger: Logger instance.
_connOptional[aiosqlite.Connection]: Shared SQLite connection (or None if not connected).
_write_queueasyncio.Queue: Asynchronous queue for batches of links to be inserted. Max size 100.
_sql_taskOptional[asyncio.Task]: Background task consuming the queue and writing to the database.
_shutdownbool: Flag to signal shutdown to the writer task.

__init__(db_file, timeout=30.0)[source]

Initialize the instance without opening the connection. Scheme is created upon the first call to connect().

Parameters:

db_file (Path) – Path to the SQLite file to be used as backend.
timeout (float, optional) – Maximum wait time for SQLite locks (default 30.0s).

async connect()[source]

Connect to the SQLite db (if not already connected), configure PRAGMA settings, create the ‘links’ table schema (if it doesn’t exist), and start the SQL writer task. This method is idempotent: if already connected, it does nothing.

Return type:: None

async batch_insert_links(links)[source]

Enqueue a batch of links for non-blocking insertion. Must be called after await connect().

Parameters:: links (List[Tuple[str, int, int, str, int, int]]) – List of tuples, each representing a link: (seg_id_1, orient_seg_1, orient_key_seg_1, seg_id_2, orient_seg_2, orient_key_seg_2).
Return type:: None

async create_indexes()[source]

Create indexes on seg_id_1 and seg_id_2 to accelerate queries. Should ideally be called after all data insertions are complete.

Return type:: None

async query_links_by_segment(segment_id)[source]

Retrieves all links where segment_id appears as seg_id_1 or seg_id_2.

Parameters:: segment_id (str) – The ID of the target segment.
Returns:: List of tuples, each representing a full link row from the database.
Return type:: List[Tuple[Any, …]]

async test_query_links(segment_id)[source]

Retrieve and categorize links related to a given segment. - “before”: links where segment_id is seg_id_2. - “after”: links where segment_id is seg_id_1.

Parameters:: segment_id (str) – The segment to analyze.
Returns:: List of tuples – (connected_segment_id, position_type, orient_seg_1, orient_seg_2).
Return type:: List[Tuple[str, str, int, int]]

async find_children_and_grandchildren(node_id)[source]

Find direct successors (children) and second-degree successors (grandchildren) of a segment. A child is seg_id_2 where node_id is seg_id_1. A grandchild is a child of a child.

Parameters:: node_id (str) – The starting segment ID.
Returns:: A dictionary – {“children”: [IDs], “grandchildren”: [IDs]}.
Return type:: Dict[str, List[str]]

async close()[source]

Properly shut down the database: - Signal the SQL writer task to stop. - Wait for the writer task to finish processing its queue (with timeout). - Cancel the task if it doesn’t finish in time. - Create indexes (important to do this after all writes). - Close the SQLite connection.

Return type:: None

Parameters:

db_file (Path)
timeout (float)

class gratools.Graph.AsyncBedWriter(bed_dir, batch_size=1, progress=None)[source]

Bases: object

Asynchronous BED file writer using aiofiles. It buffers lines per sample and writes them in batches to separate BED files.

Attributes

bed_dirPath: Output directory for .bed files.
batch_sizeint: Number of lines to buffer per sample before an automatic flush.
progressOptional[Progress]: Rich Progress instance for displaying progress (if provided).
loggerlogging.Logger: Logger instance.
_queueasyncio.Queue[Tuple[str, List[str]]]: Internal queue for (sample_name, lines_to_write) tuples. Max size 100.
_shutdownbool: Flag to signal shutdown to the writer loop.
_taskOptional[asyncio.Task]: The background asyncio task running the _writer_loop.

start()[source]

Start the writer loop as a background task if not already running.

Return type:: None

async enqueue(sample, lines)[source]

Add a sample and its corresponding lines to the write queue. Use put_nowait assuming the queue rarely fills; consider await self._queue.put() if backpressure to the producer is acceptable when the queue is full.

Parameters:

sample (str) – The sample name (used for filename).
lines (List[str]) – A list of strings (lines) to write to the BED file. Each string should include its own newline character if needed.

Return type:

None

async enqueue_single_line(sample, line)[source]

Asynchronously adds a single line to the write queue. This will wait if the queue is full, providing backpressure.

Parameters:

sample (str)
line (str)

async shutdown()[source]

Signal the writer loop to shut down and wait for it to complete. Ensure all pending data is flushed.

Return type:: None

Parameters:

bed_dir (Path)
batch_size (int)
progress (Progress | None)

class gratools.Graph.GFA(gfa_path, threads=1, logger=<factory>, disable_progress_flag=False, gfa_name=None, version=None, header_gfa=<factory>, sample_reference=None, bam_segments_file=None, dict_samples_chrom=<factory>, dict_segments_size=<factory>, dict_segments_samples=<factory>, dict_samples_bed=<factory>, works_path=None, bed_path=None, bam_path=None, found_minigraph=False, index_links=False, db_links=None, segment_count=0, total_segment_length=0, link_count=0, degrees=<factory>, walks_count=0, max_walk_rank=0, sum_rank0_length=0, input_genome_size=0, walks_info=<factory>, inverted_links_count=0, negative_links_count=0, self_links_count=0, isolated_segments=<factory>, shared_executor=None, progress=None, line_type_counts=<factory>, header_gfa_file=None, stats_file=None, db_file_path=None)[source]

Bases: object

Manage parsing of a GFA (Graphical Fragment Assembly) file, compute statistics, and generate related files such as BAM-containing segments files and BED files for path per sample. Fill an asynchronous database for links and asynchronous BED writer.

Attributes

gfa_pathPath: Path to the input GFA file (can be .gfa or .gfa.gz).
threadsint, optional: Number of threads for operations like BAM file processing. Default is 1.
loggerlogging.Logger: Logger object. Default is a logger named “GraTools”.
gfa_nameOptional[str]: Name of the GFA file derived from gfa_path (without extensions). Auto-initialized.
versionOptional[str]: GFA version extracted from the header (e.g., “1.0”). Auto-initialized.
header_gfaList[str]: List of header lines (H lines) from the GFA file. Auto-initialized.
sample_referenceOptional[str]: Reference sample name, potentially from GFA header (RS tag). Auto-initialized.
bam_segments_fileOptional[Path]: Path to the BAM file where segments (S lines) will be written. Auto-initialized.
dict_samples_chromdefaultdict[str, OrderedDict[str, List[str]]]: Map sample names to an OrderedDict of chromosome names, which in turn map to a list of “starttstop” fragment strings derived from Walk (W) lines. Auto-initialized.
dict_segments_sizedefaultdict[str, int]: Map segment IDs and their length (in base pairs). Auto-initialized.
dict_segments_samplesdefaultdict[str, List[str]]: Map segment IDs to a list of sample identifiers (“sample;chromosome;haplotype”) that contain the segment. Auto-initialized.
dict_samples_beddefaultdict[str, OrderedDict[str, Path]]: (Note: This attribute is not directly populated by the current parsing logic. It was likely intended to track the paths of generated BED files. The AsyncBedWriter internally manages these paths. This attribute might be redundant or used for post-processing tracking). Auto-initialized.
works_pathOptional[Path]: Path to the working directory (e.g., “…/{gfa_name}_GraTools_INDEX”). Auto-initialized.
bed_pathOptional[Path]: Path to the subdirectory for BED files within works_path. Auto-initialized.
bam_pathOptional[Path]: Path to the subdirectory for BAM files within works_path. Auto-initialized.
found_minigraphbool: Flag indicating if a sample named ‘MINIGRAPH’ (case-insensitive) was found in Walk lines. Defaults to False.
index_linksbool, optional: If True, GFA links (L lines) are stored in an SQLite database. Defaults to True.
db_linksOptional[AsyncGfaDatabase]: Asynchronous database handler for GFA links. Auto-initialized if index_links is True.
segment_countint: Total number of segments (S lines) processed. Defaults to 0.
total_segment_lengthint: Sum of lengths of all segments. Defaults to 0.
link_countint: Total number of links (L lines) processed. Defaults to 0.
degreesdefaultdict[str, int]: Maps segment IDs to their degree (number of links connected). Defaults to an empty defaultdict.
walks_countint: Total number of walks (W lines) processed. Defaults to 0.
max_walk_rankint: Maximum number of segments in any single walk. Defaults to 0.
sum_rank0_lengthint: Sum of lengths of the first segments of all walks. Defaults to 0.
input_genome_sizeint: Cumulative size of all paths (sum of segment lengths along each walk). Defaults to 0.
walks_infoList[Dict[str, Any]]: List of dictionaries, each containing info for a walk (Path name, Sequence length, Num Segments). Defaults to an empty list.
inverted_links_countint: Count of links where orientations differ (e.g., S1+ -> S2-). Defaults to 0.
negative_links_countint: Count of links where both segments have negative orientation (S1- -> S2-). Defaults to 0.
self_links_countint: Count of links where a segment links to itself (S1 -> S1). Defaults to 0.
isolated_segmentsSet[str]: Set of segment IDs that have no links connected to them. Initialized with all segments, then linked ones removed. Defaults to an empty set.
shared_executorOptional[ThreadPoolExecutor]: Executor for running synchronous tasks in threads. Auto-initialized.
progressOptional[Progress]: Rich Progress instance for displaying progress. Auto-initialized.
line_type_countsCounter: Counts of each GFA line type (H, S, L, W, P, C, E, U). Auto-initialized.
header_gfa_fileOptional[Path]: Path to where the GFA header is saved. Auto-initialized.
stats_fileOptional[Path]: Path to where GFA statistics are saved. Auto-initialized.
db_file_pathOptional[Path]: Path to the SQLite database file for links. Auto-initialized.
RE_ORIENTED_SEG_GT_LTre.Pattern: Compiled regular expression for parsing oriented segments using ‘>’ and ‘<’. Auto-initialized.
RE_ORIENTED_SEG_PLUS_MINUSre.Pattern: Compiled regular expression for parsing oriented segments using ‘+’ and ‘-’. Auto-initialized.
disable_progress_flag: bool: If True, progress bars are disabled. Defaults to False.

gfa_path: Path = <dataclasses._MISSING_TYPE object>

threads: int = 1

logger: Logger = <dataclasses._MISSING_TYPE object>

disable_progress_flag: bool | None = False

gfa_name: str | None = None

version: str | None = None

header_gfa: List[str] = <dataclasses._MISSING_TYPE object>

sample_reference: str | None = None

bam_segments_file: Path | None = None

processed_sample_names: Set[str] = <dataclasses._MISSING_TYPE object>

samples_chrom_file_writer: TextIO | None = None

samples_chrom_file_path: Path | None = None

dict_samples_chrom: defaultdict = <dataclasses._MISSING_TYPE object>

dict_segments_size: defaultdict = <dataclasses._MISSING_TYPE object>

dict_segments_samples: defaultdict = <dataclasses._MISSING_TYPE object>

dict_samples_bed: defaultdict = <dataclasses._MISSING_TYPE object>

works_path: Path | None = None

bed_path: Path | None = None

bam_path: Path | None = None

found_minigraph: bool = False

index_links: bool = False

db_links: AsyncGfaDatabase | None = None

segment_count: int = 0

total_segment_length: int = 0

link_count: int = 0

degrees: defaultdict = <dataclasses._MISSING_TYPE object>

walks_count: int = 0

max_walk_rank: int = 0

sum_rank0_length: int = 0

input_genome_size: int = 0

walks_info: List[Dict[str, Any]] = <dataclasses._MISSING_TYPE object>

inverted_links_count: int = 0

negative_links_count: int = 0

self_links_count: int = 0

isolated_segments: set = <dataclasses._MISSING_TYPE object>

shared_executor: ThreadPoolExecutor | None = None

progress: Progress | None = None

line_type_counts: Counter = <dataclasses._MISSING_TYPE object>

header_gfa_file: Path | None = None

stats_file: Path | None = None

db_file_path: Path | None = None

RE_ORIENTED_SEG_GT_LT: Pattern = <dataclasses._MISSING_TYPE object>

RE_ORIENTED_SEG_PLUS_MINUS: Pattern = <dataclasses._MISSING_TYPE object>

save_header()[source]

Save the GFA header lines (H lines) to a text file.

Return type:: None

sort_file_in_place()[source]: Sort a file in place using Unix commands without creating a temporary copy. The file must be in the format: sample chromosome start end

async parse_gfa()[source]

Parses the GFA file line by line, processing headers, segments, links, and walks. Segments are written to a BAM file. Links are (optionally) stored in an async SQLite DB. Walk information is used to generate data for BED files, written by AsyncBedWriter.

Return type:: None

run()[source]: Synchronous entry point to orchestrate GFA parsing and BED file sorting. Sets up an asyncio event loop and runs the asynchronous parse_gfa method. Then, sorts the generated BED files using a ThreadPoolExecutor.

tag_bam()[source]

Tags the generated BAM segments file with sample walk information (SW tag). This uses the GratoolsBam class to perform the tagging. The original BAM segments file is overwritten with the tagged version.

Return type:: None

compute_statistics()[source]

Orchestrates the computation of GFA graph statistics in parallel and saves the results. This method acts as a dispatcher, running heavy calculations in separate threads.

Return type:: dict[str, any]

Parameters:

gfa_path (Path)
threads (int)
logger (Logger)
disable_progress_flag (bool | None)
gfa_name (str | None)
version (str | None)
header_gfa (List[str])
sample_reference (str | None)
bam_segments_file (Path | None)
dict_samples_chrom (defaultdict)
dict_segments_size (defaultdict)
dict_segments_samples (defaultdict)
dict_samples_bed (defaultdict)
works_path (Path | None)
bed_path (Path | None)
bam_path (Path | None)
found_minigraph (bool)
index_links (bool)
db_links (AsyncGfaDatabase | None)
segment_count (int)
total_segment_length (int)
link_count (int)
degrees (defaultdict)
walks_count (int)
max_walk_rank (int)
sum_rank0_length (int)
input_genome_size (int)
walks_info (List[Dict[str, Any]])
inverted_links_count (int)
negative_links_count (int)
self_links_count (int)
isolated_segments (set)
shared_executor (ThreadPoolExecutor | None)
progress (Progress | None)
line_type_counts (Counter)
header_gfa_file (Path | None)
stats_file (Path | None)
db_file_path (Path | None)

class gratools.Graph.GratoolsBam(bam_path, threads=1, logger=<factory>, suffix=None, works_path=None, gfa_name=None, tagging=False, disable_progress_flag=False)[source]

Bases: object

Handles operations related to BAM files in the GraTools context, such as indexing, extracting segment information, tagging segments, and performing various analyses (core/dispensable ratio, depth statistics, etc.).

Attributes

bam_pathPath: Path to the BAM file.
threadsint, optional: Number of threads for BAM operations (e.g., reading, indexing). Defaults to 1.
loggerlogging.Logger: Logger instance. Defaults to a logger named “GraTools”.
suffixOptional[str], optional: Suffix to append to output filenames generated by analyses. Defaults to None.
works_pathOptional[Path], optional: Working directory path for saving output files. Defaults to None (uses BAM parent dir).
gfa_nameOptional[str], optional: Name of the associated GFA file (used for naming output files). Defaults to None.
taggingbool, optional: If True, indicates that operations might modify tags, potentially requiring re-indexing. Used by index_bam to decide if indexing is needed. Defaults to False.
progressOptional[Progress]: Rich Progress instance for displaying progress. Auto-initialized.
disable_progress_flag Optional[bool], optional: Flag to disable progress bar. Defaults to False.

bam_path: Path = <dataclasses._MISSING_TYPE object>

threads: int = 1

logger: Logger = <dataclasses._MISSING_TYPE object>

suffix: str | None = None

works_path: Path | None = None

gfa_name: str | None = None

tagging: bool = False

disable_progress_flag: bool = False

progress: Progress | None = <dataclasses._MISSING_TYPE object>

index_bam()[source]

Indexes the BAM file using pysam.index if the index is missing or outdated. The index file will have a ‘.bai’ or ‘.crai’ extension depending on BAM format.

Return type:: None

build_segments(list_segments=None)[source]

Extracts specified segments from a BAM file and reconstructs their GFA S-line representation. Also populates dictionaries for segment samples and sequences.

Parameters

list_segmentsOptional[List[str]], optional: A list of segment IDs (query_name in BAM) to extract. If None or empty, this method might process all segments or return empty results, depending on intended behavior (current pysam.view call implies it needs a list).

Returns

Tuple[List[str], defaultdict[str, List[str]], defaultdict[str, str]]

gfa_s_lines_list: List of strings, each a GFA S-line.
dict_seg_samples: defaultdict mapping segment ID to list of “sample;chrom;haplo” strings from SW tag.
dict_seg_sequence: defaultdict mapping segment ID to its sequence.

Parameters:: list_segments (List[str] | None)
Return type:: Tuple[List[str], defaultdict, defaultdict]

tag(dict_segments_samples, nb_segments)[source]

Adds or updates the ‘SW’ (Sample Walks) tag to segments in the BAM file. This version uses an integer-to-string mapping for walk paths to improve performance.

The SW tag stores a comma-separated list of “sample;chromosome;haplotype” strings indicating which walks/paths contain the segment. The original BAM file is overwritten with the tagged version.

Parameters

dict_segments_samplesDict[str, List[int]]: Dictionary mapping segment IDs (query_name) to a list of integer IDs representing the walk paths.
nb_segmentsint: The total number of segments in the BAM file.

Returns

Path: The path to the (now tagged and re-indexed) BAM file.

Raises

FileNotFoundError: If the input BAM file does not exist.
Exception: If errors occur during BAM reading, writing, or renaming.

Parameters:

dict_segments_samples (Dict[str, List[int]])
nb_segments (int)

Return type:

Path

core_dispensable_ratio(nb_samples_gfa, input_as_number, shared_min_cutoff=1, specific_max_cutoff=None, filter_min_len=1)[source]

Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in. Saves results to a CSV file.

Parameters

nb_samples_gfaint: Total number of unique samples present in the GFA (used for percentage calculation).
input_as_numberbool: If True, shared_min_cutoff and specific_max_cutoff are treated as absolute counts of samples. If False, they are treated as percentages of nb_samples_gfa.
shared_min_cutoffint, optional: Minimum number/percentage of samples a segment must be in to be considered “shared” (core). Defaults to 1.
specific_max_cutoffOptional[int], optional: Maximum number/percentage of samples a segment can be in to be considered “specific” (dispensable). If None, specific analysis might be skipped or use a default (e.g., 1 if input_as_number). Defaults to None.
filter_min_lenint, optional: Minimum length (bp) for a segment to be included in the filtered analysis. Defaults to 1 (no length filter).

Parameters:

nb_samples_gfa (int)
input_as_number (bool)
shared_min_cutoff (int)
specific_max_cutoff (int | None)
filter_min_len (int)

Return type:

None

depth_nodes_stat(nb_samples_gfa, filter_min_len=1)[source]

Calculates and displays statistics about segment depth (number of unique samples a segment is found in). Outputs results to console and a CSV file.

Parameters

nb_samples_gfaint: Total number of unique samples in the GFA, used for context if needed (not directly in calcs here).
filter_min_lenint, optional: Minimum length (bp) for a segment to be included in the filtered depth analysis. Defaults to 1 (no effective length filter).

Parameters:

nb_samples_gfa (int)
filter_min_len (int)

Return type:

None

get_specific_and_shared_segments(samples_list_A, samples_list_B=None, filter_min_len=None, output_csv=None)[source]

Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.

Parameters

samples_list_AList[str]: A list of sample names. A segment is “shared” if it is present in ALL samples in this list.
samples_list_BOptional[List[str]], optional: An optional second list of sample names. If provided, a segment is “specific” if it is shared by all in samples_list_A AND absent from ALL samples in this list.
filter_min_lenOptional[int], optional: If set, only segments with a length greater than or equal to this value will be considered.
output_csvOptional[bool], optional: If True, the function will return sets of the shared and specific segment IDs.

Returns

Tuple[Set[str], Set[str]]

A set of segment IDs that are shared by all samples in samples_list_A.
A set of segment IDs that are specific to samples_list_A relative to samples_list_B. (This set is a subset of the first one).

Parameters:

samples_list_A (List[str])
samples_list_B (List[str] | None)
filter_min_len (int | None)
output_csv (bool | None)

Return type:

Tuple[Set[str], Set[str]]

get_segments_and_positions_by_depth(total_gfa_samples, input_as_number, lower_bound_depth, upper_bound_depth, filter_min_len, bed_path=None)[source]

Finds segments within a specific depth range and retrieves their genomic positions from BED files.

This function performs two main steps:

Scans the BAM file to identify segments that meet the specified depth and length criteria.

For those segments, it efficiently queries the relevant BED files to find their

exact genomic coordinates (chromosome, start, end).

Parameters

total_gfa_samplesint: Total number of unique samples in the GFA, used for percentage calculations.
input_as_numberbool: If True, depth bounds are absolute counts; if False, they are percentages of total_gfa_samples.
lower_bound_depthint: Minimum sample depth (count or percentage) for a segment to be included.
upper_bound_depthint: Maximum sample depth (count or percentage) for a segment to be included.
filter_min_lenint: Minimum length in base pairs for a segment to be considered.
bed_pathPath, optional: Path to the directory containing the sample-specific BED files.

Returns

Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]: A tuple containing two dictionaries: 1. segments_with_depth: {segment_id: depth} for all segments matching the criteria. 2. segment_locations: {segment_id: {sample_name: [(chrom, start, end), …]}}

Parameters:

total_gfa_samples (int)
input_as_number (bool)
lower_bound_depth (int)
upper_bound_depth (int)
filter_min_len (int)
bed_path (Path)

Return type:

Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]

export_nodes_to_csv(output_csv_path, core_threshold_percent=0.95)[source]

Exports enhanced node information to a CSV file using an efficient, single-pass approach.

This method aggregates data in memory before creating a final Pandas DataFrame, making it much more memory-efficient than building a list of all records. It correctly parses the ‘SW’ tag to build a list of samples for each unique node.

Parameters

output_csv_pathPath: The path where the output CSV file will be saved.
core_threshold_percentfloat, optional: The percentage of total samples above which a node is considered ‘core’. Defaults to 0.95.

Parameters:

output_csv_path (Path)
core_threshold_percent (float)

Return type:

None

Parameters:

bam_path (Path)
threads (int)
logger (Logger)
suffix (str | None)
works_path (Path | None)
gfa_name (str | None)
tagging (bool)
disable_progress_flag (bool)

Main application logic and high-level command orchestrations.

🛠️ gratools.Gratools

gratools.Gratools.flatten(list_of_lists)[source]

Flattens a list of lists into a single list.

Parameters

list_of_listsList[List[Any]]: A list where each element is itself a list.

Returns

List[Any]: A new list containing all items from the sublists.

Parameters:: list_of_lists (List[List[Any]])
Return type:: List[Any]

class gratools.Gratools.Gratools(gfa_path, threads=1, outdir=None, logger=None, gfa_name=None, bam_segments_file=None, dict_samples_chrom=<factory>, works_path=None, bed_path=None, bam_path=None, samples_chrom_path=None, dict_gfa_graph_object=<factory>, sample_name_query=None, chromosome_query=None, start_query=0, stop_query=None, suffix=None, build_fasta_flag=False, merge=None, meta=<factory>, index_links=False, debug=False, disable_progress_flag=False)[source]

Bases: object

Main class for the GraTools toolkit, orchestrating GFA file processing, subgraph extraction, and various analyses on genomic graph data.

It handles GFA indexing (delegating to the GFA class), manages input parameters, and provides an interface for operations like subgraph extraction, FASTA generation, and statistical analysis of graph components.

Attributes

gfa_pathPath: Path to the input GFA file.
threadsint, optional: Number of threads for parallelizable operations. Defaults to 1.
outdirOptional[Path], optional: Output directory for GraTools results. If None, defaults to a directory named GraTools-output_{gfa_name} in the same directory as gfa_path.
loggerOptional[logging.Logger]: Logger instance. Auto-configured in __post_init__.
gfa_nameOptional[str]: Name of the GFA file, derived from gfa_path without extensions. Auto-initialized.
bam_segments_fileOptional[Path]: Path to the BAM file containing GFA segments, located within the index directory. Auto-initialized.
dict_samples_chromdefaultdict[str, OrderedDict[str, List[Tuple[str, str]]]]: Maps sample names to an OrderedDict of chromosome names, which maps to a list of (start_fragment, stop_fragment) string tuples. Populated from samples_chrom.txt.
works_pathOptional[Path]: Path to the main GraTools output directory for the current run (e.g., outdir/GraTools-output_{gfa_name}). Auto-initialized.
bed_pathOptional[Path]: Path to the BED files subdirectory within the GFA index directory. Auto-initialized.
bam_pathOptional[Path]: Path to the BAM files subdirectory within the GFA index directory. Auto-initialized.
samples_chrom_pathOptional[Path]: Path to the samples_chrom.txt file within the GFA index directory. Auto-initialized.
dict_gfa_graph_objectDict[str, SubGraph]: Dictionary mapping sample names to their corresponding SubGraph objects after extraction. Defaults to an empty dict.
sample_name_queryOptional[str]: Name of the primary sample for query operations (e.g., subgraph extraction). Defaults to None.
chromosome_queryOptional[str]: Chromosome identifier for query operations. Defaults to None.
start_queryint: Start position for query operations (0-based). Defaults to 0.
stop_queryOptional[int]: Stop position for query operations. If None, might be inferred as chromosome end. Defaults to None.
suffixOptional[str]: Custom suffix for output files. If None, a default suffix based on query parameters is generated. Auto-initialized.
build_fasta_flagbool: Flag to enable FASTA file generation during subgraph extraction. Defaults to False.
gzip_gfabool: Flag indicating if the input GFA file is gzipped. Auto-detected.
mergeOptional[int]: Merge distance (-d for bedtools merge) for BED region processing. If -1 and query region is set, defaults to 10% of query region size. Defaults to None.
metaDict[str, Any]: Dictionary for meta-parameters like verbosity, log_path, threads, passed from CLI or config. Defaults to an empty dict.
index_linksbool: Flag to control whether GFA links are indexed into a database during GFA parsing. Defaults to True.
debugbool: Flag to enable debug mode, typically for more verbose logging or error details. Defaults to False.
index_pathOptional[Path]: Path to the GFA index directory ({gfa_name}_GraTools_INDEX). Auto-initialized.
header_gfa_fileOptional[Path]: Path to the saved GFA header file within the index directory. Auto-initialized.
stats_gfa_fileOptional[Path]: Path to the saved GFA statistics file within the index directory. Auto-initialized.
sub_graph_queryOptional[SubGraph]: SubGraph object for the primary query sample. Initialized in extract_sub_graph.
_cached_chromosome_dataOptional[pd.DataFrame] # Attribute for caching chromosome data: Internal cache for data read from samples_chrom_path to avoid redundant parsing.
disable_progress_flag: Optional[bool]: Flag to control progress bar visibility. Defaults to False.

gfa_path: Path = <dataclasses._MISSING_TYPE object>

threads: int = 1

outdir: Path | None = None

logger: Logger | None = None

gfa_name: str | None = None

bam_segments_file: Path | None = None

dict_samples_chrom: defaultdict = <dataclasses._MISSING_TYPE object>

works_path: Path | None = None

bed_path: Path | None = None

bam_path: Path | None = None

samples_chrom_path: Path | None = None

dict_gfa_graph_object: Dict[str, SubGraph] = <dataclasses._MISSING_TYPE object>

sample_name_query: str | None = None

chromosome_query: str | None = None

start_query: int = 0

stop_query: int | None = None

suffix: str | None = None

build_fasta_flag: bool = False

gzip_gfa: bool = False

merge: int | None = None

meta: Dict[str, Any] = <dataclasses._MISSING_TYPE object>

index_links: bool = False

debug: bool = False

disable_progress_flag: bool = False

index_path: Path | None = None

header_gfa_file: Path | None = None

stats_gfa_file: Path | None = None

sub_graph_query: SubGraph | None = None

get_gfa_statistics_df()[source]

Loads GFA statistics from the pre-computed statistics file into a pandas DataFrame.

Returns:: Optional[pd.DataFrame] – DataFrame with GFA statistics, or None if file not found/readable.
Return type:: DataFrame | None

save_gfa_statistics()[source]

Saves the chromosome summary per sample to a CSV file.

Return type:: None

display_gfa_statistics(by_category=False)[source]

Displays GFA statistics in a Rich Table, either categorized or as a single table.

Parameters:: by_category (bool, optional) – If True, display stats in separate tables per category. If False, display in a single comprehensive table. Defaults to False.
Return type:: None

get_chromosome_size(sample_name, chromosome_name)[source]

Gets the maximum end position (size) of a given chromosome for a specific sample. This represents the extent of the chromosome as defined by walk fragments in the GFA.

Parameters:

sample_name (str) – The name of the sample.
chromosome_name (str) – The name of the chromosome.

Returns:

Optional[int] –

The size of the chromosome (max end position of its fragments),: or None if sample/chromosome not found or no valid fragments.

Return type:

int | None

property available_sample_names: List[str]

Retrieves a sorted list of unique sample names present in the GFA data.

Returns:: List[str] – Sorted list of unique sample names.

display_available_sample_names()[source]

Displays available sample names in a Rich Table.

Return type:: None

save_available_sample_names()[source]

Saves the list of available sample names to a CSV file.

Return type:: None

get_chromosomes_summary_by_sample_df()[source]

Generates a DataFrame summarizing chromosomes per sample. Includes sample name, a comma-separated list of unique chromosome names, and the count of unique chromosomes for that sample.

Returns:

Optional[pd.DataFrame] –

DataFrame with columns [“SAMPLES”, “CHROMOSOMES_LIST”, “NUM_UNIQUE_CHROMOSOMES”],: or None if data cannot be loaded/processed.

Return type:

DataFrame | None

save_chromosomes_summary_by_sample()[source]

Saves the chromosome summary per sample to a CSV file.

Return type:: None

save_full_chromosome_fragment_data()[source]

Saves the raw chromosome fragment data (sample, chrom, start, end) to a CSV file.

Return type:: None

display_chromosomes_summary()[source]

Displays the chromosome summary per sample in a Rich Table.

Return type:: None

display_full_chromosome_fragment_data()[source]

Displays the full chromosome fragment data in a Rich Table, grouped by sample.

Return type:: None

extract_subgraph(samples_list_path=None, all_samples_flag=False)[source]

Extracts a subgraph based on a query region and optionally for other specified samples. Manages SubGraph object creation, processing, and GFA/FASTA file generation.

Parameters:

samples_list_path (Optional[Path]) – Path to a file containing a list of additional samples to process (one per line).
all_samples_flag (bool) – If True and samples_list_path is not given, process all samples found in the GFA (relative to the query region).

Return type:

None

concatenate_and_generate_subgfa_file()[source]

Concatenates GFA components (Header, Segments, Links, Walks) from all processed SubGraph objects and writes them to a combined GFA file (gzipped).

Return type:: None

generate_combined_fasta_file()[source]

Generates a combined FASTA file from sequences collected in all processed SubGraph objects.

Return type:: None

run_core_dispensable_ratio_analysis(input_as_number, shared_min, specific_max, filter_len)[source]

Runs core/dispensable segment ratio analysis using GratoolsBam. Parameters mirror those of GratoolsBam.core_dispensable_ratio.

Parameters:

input_as_number (bool)
shared_min (int)
specific_max (int | None)
filter_len (int)

Return type:

None

run_depth_nodes_statistics(filter_len)[source]

Runs node depth statistics analysis using GratoolsBam.

Parameters:: filter_len (int)
Return type:: None

run_get_specific_groups_sample_analysis(sample_list_a_path, sample_list_b_path, filter_len, output_csv)[source]

Run get_specific_groups_sample and saves the result in a file

Parameters:

sample_list_a (list) – List of samples to check for shared segments.
sample_list_b (list) – List of samples to check for specific segments.
filter_len (int, optional) – Minimum length of segments to be considered.
output_csv (bool) – output the segments in a csv file (if False, only print stats)
sample_list_a_path (Path | None)
sample_list_b_path (Path | None)

Return type:

None

find_specific_groups_sample_position(segment_list_shared, sample_list_a=None)[source]: Finds segment positions using a streaming approach to balance RAM and I/O performance. It reads awk’s output line by line without loading the full result into memory or writing intermediate filtered BED files to disk.

get_segments_by_depth(input_as_number, lower_bound, upper_bound, filter_len)[source]

Retrieves segments within a specific depth range, using GratoolsBam. Returns a dictionary of {segment_id: depth}.

Parameters:

input_as_number (bool)
lower_bound (int)
upper_bound (int)
filter_len (int)

Return type:

Dict[str, int]

display_or_save_segments_by_depth(input_as_number, lower_bound, upper_bound, filter_len, output_to_file)[source]

Retrieves segments by depth and either displays them in a Rich Table (if output_to_file is False) or saves them to a CSV file (if output_to_file is True).

Parameters:

output_to_file (bool) – If True, save to CSV. If False, print to terminal.
input_as_number (bool)
lower_bound (int)
upper_bound (int)
filter_len (int)

Return type:

None

export_to_bandage_csv(output_csv_path=None)[source]

Exports node (segment) information to a CSV file compatible with Bandage. This method uses the indexed BAM file to calculate properties like length and depth.

Parameters:: output_csv_path (Optional[Path]) – The path to save the output CSV file. If None, a default path is generated in the GFA directory.
Return type:: None

Parameters:

gfa_path (Path)
threads (int)
outdir (Path | None)
logger (Logger | None)
gfa_name (str | None)
bam_segments_file (Path | None)
dict_samples_chrom (defaultdict)
works_path (Path | None)
bed_path (Path | None)
bam_path (Path | None)
samples_chrom_path (Path | None)
dict_gfa_graph_object (Dict[str, SubGraph])
sample_name_query (str | None)
chromosome_query (str | None)
start_query (int)
stop_query (int | None)
suffix (str | None)
build_fasta_flag (bool)
merge (int | None)
meta (Dict[str, Any])
index_links (bool)
debug (bool)
disable_progress_flag (bool)

—

System & Utilities

Entry point for the Command Line Interface (CLI).

🚀 gratools.main

General purpose helper functions and genomic utilities.

🔧 gratools.useful_function

gratools.useful_function.reverse_complement_string(s)[source]

Calcule rapidement le complément inverse d’une chaîne d’ADN.

Parameters:: s (str)
Return type:: str

class gratools.useful_function.CustomCommand(*args, **kwargs)[source]

Bases: Command

Custom Command class that applies global context settings (for help formatting) and prepends a specific header to the help message of individual commands.

__init__(*args, **kwargs)[source]

Initializes the CustomCommand.

Ensures that the predefined CONTEXT_SETTINGS are applied by default to this command.

invoke(ctx)[source]: Mesure et affiche le temps d’exécution autour de l’appel réel.

get_help(ctx)[source]

Overrides the default help generation to prepend a custom header.

The header is printed directly to the console using shared_console before the standard help text is generated and returned.

Parameters:: ctx (click.Context) – The current Click context.
Returns:: str – The formatted help text, including the prepended header.
Return type:: str

class gratools.useful_function.CustomGroup(*args, **kwargs)[source]

Bases: Group

Custom Group class that applies global context settings, prepends a header to its own help message, and ensures that all subcommands added via its command decorator use CustomCommand by default.

__init__(*args, **kwargs)[source]

Initializes the CustomGroup.

Ensures that the predefined CONTEXT_SETTINGS are applied by default to this group and its subcommands (if they don’t override).

command(*args, **kwargs)[source]

Overrides the default command decorator registration.

This ensures that any command registered using this group’s command method will automatically use CustomCommand as its class, thereby inheriting the custom help formatting and header.

Parameters:

*args – Positional arguments for the command decorator.
**kwargs – Keyword arguments for the command decorator.

Returns:

Callable – The decorator that registers the command.

get_help(ctx)[source]

Overrides the default help generation for the group to prepend a custom header.

Parameters:: ctx (click.Context) – The current Click context.
Returns:: str – The formatted help text for the group, including the prepended header.
Return type:: str

gratools.useful_function.validate_percentage_or_int(ctx, param, value)[source]

Click callback for validating that an option’s value is either an integer or a float representing a percentage (between 0.0 and 1.0 inclusive).

This function is intended to be used as a callback for a Click option.

Parameters:

ctx (click.Context) – The current Click context.
param (click.Parameter) – The Click parameter (option) being validated.
value (str | int | float) – The input value provided by the user for the option. Click might pass it as str, or already converted if type is set.

Returns:

int | float –

The validated value, converted to int if it’s a whole number,: or float if it’s a percentage (0.0-1.0).

Raises:

click.BadParameter – If the value is not a valid integer or a float between 0.0 and 1.0.

Return type:

int | float

Configuration for the specialized logging system (Rich-based).

📜 gratools.logger_config

class gratools.logger_config.ThreadedRichHandler(*args, **kwargs)[source]

Bases: RichHandler

Custom RichHandler that processes log records in a separate thread to prevent blocking the main application thread, especially during I/O-bound logging operations (like writing to console or complex formatting).

It also includes a feature to trigger a program exit if a log record of ERROR level or higher is emitted through it.

Attributes

runningbool: A flag indicating whether the log processing worker thread should continue running.
log_queuequeue.Queue[logging.LogRecord] # Type hint for clarity: A thread-safe queue used to buffer log records before they are processed by the worker thread.
worker_threadthreading.Thread: The background thread responsible for consuming log records from log_queue.
critical_error_occurredbool: A flag set to True if an ERROR or CRITICAL log has been processed, leading to program termination.

__init__(*args, **kwargs)[source]

Initializes the ThreadedRichHandler.

Sets up the log queue and starts the background worker thread.

Return type:: None

emit(record)[source]

Queues a log record for processing by the worker thread.

If the log record’s level is ERROR or higher, this method sets a flag to indicate a critical error and initiates the process to stop the log processing and exit the application.

Parameters

recordlogging.LogRecord: The log record to be emitted.

Parameters:: record (LogRecord)
Return type:: None

stop_processing()[source]

Signals the worker thread to stop and waits for it to terminate.

This ensures that the queue is flushed of pending log records before the thread exits.

Return type:: None

close()[source]

Closes the handler, ensuring the worker thread is stopped and resources are released.

Return type:: None

gratools.logger_config.configure_logger(name, log_dir_path, verbosity_level, file_suffix='')[source]

Configures and returns a logger instance with specified settings.

The logger will output to: 1. The console (via RichHandler for formatted, colorful output). 2. A general log file (e.g., ‘name_log.o’). 3. An error log file for WARNING and higher messages (e.g., ‘name_log.e’).

Parameters

namestr: The name for the logger (e.g., “GraTools”).
log_dir_pathPath: The directory path where log files will be stored.
verbosity_levelstr: The logging verbosity level (e.g., “DEBUG”, “INFO”, “ERROR”). This sets the minimum level for messages to be processed by the logger.
file_suffixstr, optional: An optional suffix to append to the base name of log files. Defaults to an empty string.

Returns

logging.Logger: The configured logger instance.

Parameters:

name (str)
log_dir_path (Path)
verbosity_level (str)
file_suffix (str)

Return type:

Logger

gratools.logger_config.update_logger_file_suffix(logger, new_file_suffix)[source]

Updates file handlers of a logger to use a new file suffix. Existing log content from the old file (if any) is copied to the new file location before switching. The old file is then deleted.

This is useful if, for instance, a specific operation (like a query) should have its logs in a uniquely named file determined mid-execution.

Parameters

loggerlogging.Logger: The logger instance whose file handlers need updating.
new_file_suffixstr: The new suffix to be incorporated into the log filenames. Example: if old was “main_log.log”, new_suffix=”_query123”, new becomes “main_log_query123.log”.

Returns

logging.Logger: The same logger instance, now with updated file handlers.

Parameters:

logger (Logger)
new_file_suffix (str)

Return type:

Logger

—

Package Contents

📦 gratools (Init)

Main package module for gratools.

no-index:

Contains core classes and functions for GraTools.

class gratools.AsyncGfaDatabase(db_file, timeout=30.0)[source]

Bases: object

Manage an asynchronous SQLite database for storing and querying GFA link data. It uses aiosqlite for non-blocking operations within an asyncio event loop and serializes writes via an internal FIFO queue to prevent SQLite lock contention.

Attributes

db_filePath: Path to the SQLite database file.
timeoutfloat: Maximum timeout (in seconds) for SQLite lock acquisition.
loggerlogging.Logger: Logger instance.
_connOptional[aiosqlite.Connection]: Shared SQLite connection (or None if not connected).
_write_queueasyncio.Queue: Asynchronous queue for batches of links to be inserted. Max size 100.
_sql_taskOptional[asyncio.Task]: Background task consuming the queue and writing to the database.
_shutdownbool: Flag to signal shutdown to the writer task.

__init__(db_file, timeout=30.0)[source]

Initialize the instance without opening the connection. Scheme is created upon the first call to connect().

Parameters:

db_file (Path) – Path to the SQLite file to be used as backend.
timeout (float, optional) – Maximum wait time for SQLite locks (default 30.0s).

async connect()[source]

Connect to the SQLite db (if not already connected), configure PRAGMA settings, create the ‘links’ table schema (if it doesn’t exist), and start the SQL writer task. This method is idempotent: if already connected, it does nothing.

Return type:: None

async batch_insert_links(links)[source]

Enqueue a batch of links for non-blocking insertion. Must be called after await connect().

Parameters:: links (List[Tuple[str, int, int, str, int, int]]) – List of tuples, each representing a link: (seg_id_1, orient_seg_1, orient_key_seg_1, seg_id_2, orient_seg_2, orient_key_seg_2).
Return type:: None

async create_indexes()[source]

Create indexes on seg_id_1 and seg_id_2 to accelerate queries. Should ideally be called after all data insertions are complete.

Return type:: None

async query_links_by_segment(segment_id)[source]

Retrieves all links where segment_id appears as seg_id_1 or seg_id_2.

Parameters:: segment_id (str) – The ID of the target segment.
Returns:: List of tuples, each representing a full link row from the database.
Return type:: List[Tuple[Any, …]]

async test_query_links(segment_id)[source]

Retrieve and categorize links related to a given segment. - “before”: links where segment_id is seg_id_2. - “after”: links where segment_id is seg_id_1.

Parameters:: segment_id (str) – The segment to analyze.
Returns:: List of tuples – (connected_segment_id, position_type, orient_seg_1, orient_seg_2).
Return type:: List[Tuple[str, str, int, int]]

async find_children_and_grandchildren(node_id)[source]

Find direct successors (children) and second-degree successors (grandchildren) of a segment. A child is seg_id_2 where node_id is seg_id_1. A grandchild is a child of a child.

Parameters:: node_id (str) – The starting segment ID.
Returns:: A dictionary – {“children”: [IDs], “grandchildren”: [IDs]}.
Return type:: Dict[str, List[str]]

async close()[source]

Properly shut down the database: - Signal the SQL writer task to stop. - Wait for the writer task to finish processing its queue (with timeout). - Cancel the task if it doesn’t finish in time. - Create indexes (important to do this after all writes). - Close the SQLite connection.

Return type:: None

Parameters:

db_file (Path)
timeout (float)

class gratools.SubGraph(bam_path, bed_path, logger=<factory>, sample_name=None, sample_name_query=None, chromosome_query=None, start_query=None, stop_query=None, offset_first=0, offset_last=0, add_start_bases_first_segment=0, intersect_bed=None, segment_id_set=<factory>, segment_id_first_query=None, segment_id_first_strand=None, segment_id_last_query=None, segment_id_last_strand=None, segment_id_first=None, segment_id_last=None, works_path=None, merge=None, build_fasta_flag=False, gfa_walk_list=<factory>, gfa_link_list=<factory>, gfa_segment_list=<factory>, dict_segments_samples=<factory>, dict_segments_sequence=<factory>, sequences_list=<factory>, progress_dict=None, task_id=None, regions=None, intersected_results_by_regions=None)[source]

Bases: object

A class representing a subgraph in a genomic analysis pipeline. It handles operations like BED file intersection, GFA walk/link/segment building, and FASTA sequence generation for a specific sample or region.

Attributes

bam_pathPath: The file path to the BAM file.
bed_pathPath: The file path to the BED file.
loggerlogging.Logger: The logger instance for logging messages. Defaults to a logger named “GraTools”.
sample_nameOptional[str]: The name of the sample, derived from bed_path. Default to None.
sample_name_queryOptional[str]: The name of the sample being queried. Default to None.
chromosome_queryOptional[str]: The chromosome name for the query. Default to None.
start_queryOptional[int]: The start position on the chromosome for the query. Default to None.
stop_queryOptional[int]: The stop position on the chromosome for the query. Default to None.
offset_firstint: The offset for the first segment in the query region. Default to 0.
offset_lastint: The offset for the last segment in the query region. Default to 0.
add_start_bases_first_segmentint: Additional start bases for the first segment (usage seems specific, consider clarifying). Default to 0.
intersect_bedOptional[BedTool]: The BedTool object for intersected BED regions. Default to None.
segment_id_setSet[str]: A set storing unique segment IDs encountered during walk building.
segment_id_first_queryOptional[str]: The ID of the first segment in the query region. Default to None.
segment_id_first_strandOptional[str]: The strand (‘+’ or ‘-’) of the first segment ID in the query. Default to None.
segment_id_last_queryOptional[str]: The ID of the last segment ID in the query region. Default to None.
segment_id_last_strandOptional[str]: The strand (‘+’ or ‘-’) of the last segment ID in the query. Default to None.
segment_id_firstOptional[str]: The first segment ID encountered for the current sample (might be same as query). Default to None.
segment_id_lastOptional[str]: The last segment ID encountered for the current sample (might be same as query). Default to None.
works_pathOptional[Path]: The working directory path. Defaults to None.
mergeOptional[int]: Merge distance parameter for BED operations (e.g., bedtools merge -d). Default to None.
build_fasta_flagOptional[bool]: Flag indicating whether FASTA sequences should be built. Default to False.
gfa_walk_listList[str]: A list of GFA walk strings (W lines). Default to an empty list.
gfa_link_listList[str]: A list of GFA link strings (L lines). Default to an empty list.
gfa_segment_listList[str]: A list of GFA segment strings (S lines). Default to an empty list.
dict_segments_samplesdefaultdict[str, List[str]]: A dictionary mapping segment IDs to a list of sample identifiers. Default to an empty defaultdict.
dict_segments_sequencedefaultdict[str, str]: A dictionary mapping segment IDs to their sequences. Default to an empty defaultdict.
sequences_listList[SeqRecord]: A list of Biopython SeqRecord objects for generated FASTA sequences. Default to an empty list.
progress_dictOptional[Dict]: A dictionary to track progress, typically for multi-processing. Default to None.
task_idOptional[TaskID]: The task ID for progress tracking with rich.progress. Default to None.
regionsOptional[List[Dict[str, Any]]]: A list of regions (dictionaries with ‘chromosome’, ‘start’, ‘stop’). Default to None.
intersected_results_by_regionsOptional[BedTool]: A BedTool object containing combined intersected results for all regions. Default to None.

bam_path: Path = <dataclasses._MISSING_TYPE object>

bed_path: Path = <dataclasses._MISSING_TYPE object>

logger: Logger = <dataclasses._MISSING_TYPE object>

sample_name: str | None = None

sample_name_query: str | None = None

chromosome_query: str | None = None

start_query: int | None = None

stop_query: int | None = None

offset_first: int = 0

offset_last: int = 0

add_start_bases_first_segment: int = 0

intersect_bed: BedTool | None = None

segment_id_set: Set[str] = <dataclasses._MISSING_TYPE object>

segment_id_first_query: str | None = None

segment_id_first_strand: str | None = None

segment_id_last_query: str | None = None

segment_id_last_strand: str | None = None

segment_id_first: str | None = None

segment_id_last: str | None = None

works_path: Path | None = None

merge: int | None = None

build_fasta_flag: bool | None = False

gfa_walk_list: List[str] = <dataclasses._MISSING_TYPE object>

gfa_link_list: List[str] = <dataclasses._MISSING_TYPE object>

gfa_segment_list: List[str] = <dataclasses._MISSING_TYPE object>

dict_segments_samples: defaultdict = <dataclasses._MISSING_TYPE object>

dict_segments_sequence: defaultdict = <dataclasses._MISSING_TYPE object>

sequences_list: List[SeqRecord] = <dataclasses._MISSING_TYPE object>

progress_dict: Dict | None = None

task_id: TaskID | None = None

regions: List[Dict[str, Any]] | None = None

intersected_results_by_regions: BedTool | None = None

compute_intersection()[source]

Compute the intersection of BED regions for each region in self.regions. Store the results in self.intersected_results_by_regions.

Return type:: None

build_walks()[source]

Build GFA walks (W lines) and links (L lines) from the intersected BED regions.

Return type:: None

build_segments()[source]

Build GFA segments (S lines) from the BAM file for segments in self.segment_id_set.

Return type:: None

filter_bed_with_awk()[source]

Filter a BED file using an awk command line to extract only lines containing an ID of interest.

Return type:: BedTool

get_chr_pos(progress_dict, task_id)[source]

Identify chromosomal regions corresponding to segments in self.segment_id_set, then compute intersections, and finally build walks and segments for these regions.

Parameters

progress_dictOptional[Dict]: A dictionary to track progress for multiprocessing.
task_idOptional[TaskID]: The task ID for rich.progress tracking.

Parameters:

progress_dict (Dict | None)
task_id (TaskID | None)

Return type:

None

build_fasta()[source]

Build FASTA sequences from the GFA walks (self.gfa_walk_list).

This method recovers FASTA sequences by processing each walk, extracting the constitutive segments, and applying specific offsets if the current sample is the query sample (self.sample_name == self.sample_name_query). The offsets trim segment sequences at the beginning or end of the walk to match the precise query region boundaries (self.start_query, self.stop_query). The resulting sequences are stored as SeqRecord objects in self.sequences_list.

Return type:: None

Parameters:

bam_path (Path)
bed_path (Path)
logger (Logger)
sample_name (str | None)
sample_name_query (str | None)
chromosome_query (str | None)
start_query (int | None)
stop_query (int | None)
offset_first (int)
offset_last (int)
add_start_bases_first_segment (int)
intersect_bed (BedTool | None)
segment_id_set (Set[str])
segment_id_first_query (str | None)
segment_id_first_strand (str | None)
segment_id_last_query (str | None)
segment_id_last_strand (str | None)
segment_id_first (str | None)
segment_id_last (str | None)
works_path (Path | None)
merge (int | None)
build_fasta_flag (bool | None)
gfa_walk_list (List[str])
gfa_link_list (List[str])
gfa_segment_list (List[str])
dict_segments_samples (defaultdict)
dict_segments_sequence (defaultdict)
sequences_list (List[SeqRecord])
progress_dict (Dict | None)
task_id (TaskID | None)
regions (List[Dict[str, Any]] | None)
intersected_results_by_regions (BedTool | None)

class gratools.GFA(gfa_path, threads=1, logger=<factory>, disable_progress_flag=False, gfa_name=None, version=None, header_gfa=<factory>, sample_reference=None, bam_segments_file=None, dict_samples_chrom=<factory>, dict_segments_size=<factory>, dict_segments_samples=<factory>, dict_samples_bed=<factory>, works_path=None, bed_path=None, bam_path=None, found_minigraph=False, index_links=False, db_links=None, segment_count=0, total_segment_length=0, link_count=0, degrees=<factory>, walks_count=0, max_walk_rank=0, sum_rank0_length=0, input_genome_size=0, walks_info=<factory>, inverted_links_count=0, negative_links_count=0, self_links_count=0, isolated_segments=<factory>, shared_executor=None, progress=None, line_type_counts=<factory>, header_gfa_file=None, stats_file=None, db_file_path=None)[source]

Bases: object

Manage parsing of a GFA (Graphical Fragment Assembly) file, compute statistics, and generate related files such as BAM-containing segments files and BED files for path per sample. Fill an asynchronous database for links and asynchronous BED writer.

Attributes

gfa_pathPath: Path to the input GFA file (can be .gfa or .gfa.gz).
threadsint, optional: Number of threads for operations like BAM file processing. Default is 1.
loggerlogging.Logger: Logger object. Default is a logger named “GraTools”.
gfa_nameOptional[str]: Name of the GFA file derived from gfa_path (without extensions). Auto-initialized.
versionOptional[str]: GFA version extracted from the header (e.g., “1.0”). Auto-initialized.
header_gfaList[str]: List of header lines (H lines) from the GFA file. Auto-initialized.
sample_referenceOptional[str]: Reference sample name, potentially from GFA header (RS tag). Auto-initialized.
bam_segments_fileOptional[Path]: Path to the BAM file where segments (S lines) will be written. Auto-initialized.
dict_samples_chromdefaultdict[str, OrderedDict[str, List[str]]]: Map sample names to an OrderedDict of chromosome names, which in turn map to a list of “starttstop” fragment strings derived from Walk (W) lines. Auto-initialized.
dict_segments_sizedefaultdict[str, int]: Map segment IDs and their length (in base pairs). Auto-initialized.
dict_segments_samplesdefaultdict[str, List[str]]: Map segment IDs to a list of sample identifiers (“sample;chromosome;haplotype”) that contain the segment. Auto-initialized.
dict_samples_beddefaultdict[str, OrderedDict[str, Path]]: (Note: This attribute is not directly populated by the current parsing logic. It was likely intended to track the paths of generated BED files. The AsyncBedWriter internally manages these paths. This attribute might be redundant or used for post-processing tracking). Auto-initialized.
works_pathOptional[Path]: Path to the working directory (e.g., “…/{gfa_name}_GraTools_INDEX”). Auto-initialized.
bed_pathOptional[Path]: Path to the subdirectory for BED files within works_path. Auto-initialized.
bam_pathOptional[Path]: Path to the subdirectory for BAM files within works_path. Auto-initialized.
found_minigraphbool: Flag indicating if a sample named ‘MINIGRAPH’ (case-insensitive) was found in Walk lines. Defaults to False.
index_linksbool, optional: If True, GFA links (L lines) are stored in an SQLite database. Defaults to True.
db_linksOptional[AsyncGfaDatabase]: Asynchronous database handler for GFA links. Auto-initialized if index_links is True.
segment_countint: Total number of segments (S lines) processed. Defaults to 0.
total_segment_lengthint: Sum of lengths of all segments. Defaults to 0.
link_countint: Total number of links (L lines) processed. Defaults to 0.
degreesdefaultdict[str, int]: Maps segment IDs to their degree (number of links connected). Defaults to an empty defaultdict.
walks_countint: Total number of walks (W lines) processed. Defaults to 0.
max_walk_rankint: Maximum number of segments in any single walk. Defaults to 0.
sum_rank0_lengthint: Sum of lengths of the first segments of all walks. Defaults to 0.
input_genome_sizeint: Cumulative size of all paths (sum of segment lengths along each walk). Defaults to 0.
walks_infoList[Dict[str, Any]]: List of dictionaries, each containing info for a walk (Path name, Sequence length, Num Segments). Defaults to an empty list.
inverted_links_countint: Count of links where orientations differ (e.g., S1+ -> S2-). Defaults to 0.
negative_links_countint: Count of links where both segments have negative orientation (S1- -> S2-). Defaults to 0.
self_links_countint: Count of links where a segment links to itself (S1 -> S1). Defaults to 0.
isolated_segmentsSet[str]: Set of segment IDs that have no links connected to them. Initialized with all segments, then linked ones removed. Defaults to an empty set.
shared_executorOptional[ThreadPoolExecutor]: Executor for running synchronous tasks in threads. Auto-initialized.
progressOptional[Progress]: Rich Progress instance for displaying progress. Auto-initialized.
line_type_countsCounter: Counts of each GFA line type (H, S, L, W, P, C, E, U). Auto-initialized.
header_gfa_fileOptional[Path]: Path to where the GFA header is saved. Auto-initialized.
stats_fileOptional[Path]: Path to where GFA statistics are saved. Auto-initialized.
db_file_pathOptional[Path]: Path to the SQLite database file for links. Auto-initialized.
RE_ORIENTED_SEG_GT_LTre.Pattern: Compiled regular expression for parsing oriented segments using ‘>’ and ‘<’. Auto-initialized.
RE_ORIENTED_SEG_PLUS_MINUSre.Pattern: Compiled regular expression for parsing oriented segments using ‘+’ and ‘-’. Auto-initialized.
disable_progress_flag: bool: If True, progress bars are disabled. Defaults to False.

gfa_path: Path = <dataclasses._MISSING_TYPE object>

threads: int = 1

logger: Logger = <dataclasses._MISSING_TYPE object>

disable_progress_flag: bool | None = False

gfa_name: str | None = None

version: str | None = None

header_gfa: List[str] = <dataclasses._MISSING_TYPE object>

sample_reference: str | None = None

bam_segments_file: Path | None = None

processed_sample_names: Set[str] = <dataclasses._MISSING_TYPE object>

samples_chrom_file_writer: TextIO | None = None

samples_chrom_file_path: Path | None = None

dict_samples_chrom: defaultdict = <dataclasses._MISSING_TYPE object>

dict_segments_size: defaultdict = <dataclasses._MISSING_TYPE object>

dict_segments_samples: defaultdict = <dataclasses._MISSING_TYPE object>

dict_samples_bed: defaultdict = <dataclasses._MISSING_TYPE object>

works_path: Path | None = None

bed_path: Path | None = None

bam_path: Path | None = None

found_minigraph: bool = False

index_links: bool = False

db_links: AsyncGfaDatabase | None = None

segment_count: int = 0

total_segment_length: int = 0

link_count: int = 0

degrees: defaultdict = <dataclasses._MISSING_TYPE object>

walks_count: int = 0

max_walk_rank: int = 0

sum_rank0_length: int = 0

input_genome_size: int = 0

walks_info: List[Dict[str, Any]] = <dataclasses._MISSING_TYPE object>

inverted_links_count: int = 0

negative_links_count: int = 0

self_links_count: int = 0

isolated_segments: set = <dataclasses._MISSING_TYPE object>

shared_executor: ThreadPoolExecutor | None = None

progress: Progress | None = None

line_type_counts: Counter = <dataclasses._MISSING_TYPE object>

header_gfa_file: Path | None = None

stats_file: Path | None = None

db_file_path: Path | None = None

RE_ORIENTED_SEG_GT_LT: Pattern = <dataclasses._MISSING_TYPE object>

RE_ORIENTED_SEG_PLUS_MINUS: Pattern = <dataclasses._MISSING_TYPE object>

save_header()[source]

Save the GFA header lines (H lines) to a text file.

Return type:: None

sort_file_in_place()[source]: Sort a file in place using Unix commands without creating a temporary copy. The file must be in the format: sample chromosome start end

async parse_gfa()[source]

Parses the GFA file line by line, processing headers, segments, links, and walks. Segments are written to a BAM file. Links are (optionally) stored in an async SQLite DB. Walk information is used to generate data for BED files, written by AsyncBedWriter.

Return type:: None

run()[source]: Synchronous entry point to orchestrate GFA parsing and BED file sorting. Sets up an asyncio event loop and runs the asynchronous parse_gfa method. Then, sorts the generated BED files using a ThreadPoolExecutor.

tag_bam()[source]

Tags the generated BAM segments file with sample walk information (SW tag). This uses the GratoolsBam class to perform the tagging. The original BAM segments file is overwritten with the tagged version.

Return type:: None

compute_statistics()[source]

Orchestrates the computation of GFA graph statistics in parallel and saves the results. This method acts as a dispatcher, running heavy calculations in separate threads.

Return type:: dict[str, any]

Parameters:

gfa_path (Path)
threads (int)
logger (Logger)
disable_progress_flag (bool | None)
gfa_name (str | None)
version (str | None)
header_gfa (List[str])
sample_reference (str | None)
bam_segments_file (Path | None)
dict_samples_chrom (defaultdict)
dict_segments_size (defaultdict)
dict_segments_samples (defaultdict)
dict_samples_bed (defaultdict)
works_path (Path | None)
bed_path (Path | None)
bam_path (Path | None)
found_minigraph (bool)
index_links (bool)
db_links (AsyncGfaDatabase | None)
segment_count (int)
total_segment_length (int)
link_count (int)
degrees (defaultdict)
walks_count (int)
max_walk_rank (int)
sum_rank0_length (int)
input_genome_size (int)
walks_info (List[Dict[str, Any]])
inverted_links_count (int)
negative_links_count (int)
self_links_count (int)
isolated_segments (set)
shared_executor (ThreadPoolExecutor | None)
progress (Progress | None)
line_type_counts (Counter)
header_gfa_file (Path | None)
stats_file (Path | None)
db_file_path (Path | None)

class gratools.GratoolsBam(bam_path, threads=1, logger=<factory>, suffix=None, works_path=None, gfa_name=None, tagging=False, disable_progress_flag=False)[source]

Bases: object

Handles operations related to BAM files in the GraTools context, such as indexing, extracting segment information, tagging segments, and performing various analyses (core/dispensable ratio, depth statistics, etc.).

Attributes

bam_pathPath: Path to the BAM file.
threadsint, optional: Number of threads for BAM operations (e.g., reading, indexing). Defaults to 1.
loggerlogging.Logger: Logger instance. Defaults to a logger named “GraTools”.
suffixOptional[str], optional: Suffix to append to output filenames generated by analyses. Defaults to None.
works_pathOptional[Path], optional: Working directory path for saving output files. Defaults to None (uses BAM parent dir).
gfa_nameOptional[str], optional: Name of the associated GFA file (used for naming output files). Defaults to None.
taggingbool, optional: If True, indicates that operations might modify tags, potentially requiring re-indexing. Used by index_bam to decide if indexing is needed. Defaults to False.
progressOptional[Progress]: Rich Progress instance for displaying progress. Auto-initialized.
disable_progress_flag Optional[bool], optional: Flag to disable progress bar. Defaults to False.

bam_path: Path = <dataclasses._MISSING_TYPE object>

threads: int = 1

logger: Logger = <dataclasses._MISSING_TYPE object>

suffix: str | None = None

works_path: Path | None = None

gfa_name: str | None = None

tagging: bool = False

disable_progress_flag: bool = False

progress: Progress | None = <dataclasses._MISSING_TYPE object>

index_bam()[source]

Indexes the BAM file using pysam.index if the index is missing or outdated. The index file will have a ‘.bai’ or ‘.crai’ extension depending on BAM format.

Return type:: None

build_segments(list_segments=None)[source]

Extracts specified segments from a BAM file and reconstructs their GFA S-line representation. Also populates dictionaries for segment samples and sequences.

Parameters

list_segmentsOptional[List[str]], optional: A list of segment IDs (query_name in BAM) to extract. If None or empty, this method might process all segments or return empty results, depending on intended behavior (current pysam.view call implies it needs a list).

Returns

Tuple[List[str], defaultdict[str, List[str]], defaultdict[str, str]]

gfa_s_lines_list: List of strings, each a GFA S-line.
dict_seg_samples: defaultdict mapping segment ID to list of “sample;chrom;haplo” strings from SW tag.
dict_seg_sequence: defaultdict mapping segment ID to its sequence.

Parameters:: list_segments (List[str] | None)
Return type:: Tuple[List[str], defaultdict, defaultdict]

tag(dict_segments_samples, nb_segments)[source]

Adds or updates the ‘SW’ (Sample Walks) tag to segments in the BAM file. This version uses an integer-to-string mapping for walk paths to improve performance.

The SW tag stores a comma-separated list of “sample;chromosome;haplotype” strings indicating which walks/paths contain the segment. The original BAM file is overwritten with the tagged version.

Parameters

dict_segments_samplesDict[str, List[int]]: Dictionary mapping segment IDs (query_name) to a list of integer IDs representing the walk paths.
nb_segmentsint: The total number of segments in the BAM file.

Returns

Path: The path to the (now tagged and re-indexed) BAM file.

Raises

FileNotFoundError: If the input BAM file does not exist.
Exception: If errors occur during BAM reading, writing, or renaming.

Parameters:

dict_segments_samples (Dict[str, List[int]])
nb_segments (int)

Return type:

Path

core_dispensable_ratio(nb_samples_gfa, input_as_number, shared_min_cutoff=1, specific_max_cutoff=None, filter_min_len=1)[source]

Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in. Saves results to a CSV file.

Parameters

nb_samples_gfaint: Total number of unique samples present in the GFA (used for percentage calculation).
input_as_numberbool: If True, shared_min_cutoff and specific_max_cutoff are treated as absolute counts of samples. If False, they are treated as percentages of nb_samples_gfa.
shared_min_cutoffint, optional: Minimum number/percentage of samples a segment must be in to be considered “shared” (core). Defaults to 1.
specific_max_cutoffOptional[int], optional: Maximum number/percentage of samples a segment can be in to be considered “specific” (dispensable). If None, specific analysis might be skipped or use a default (e.g., 1 if input_as_number). Defaults to None.
filter_min_lenint, optional: Minimum length (bp) for a segment to be included in the filtered analysis. Defaults to 1 (no length filter).

Parameters:

nb_samples_gfa (int)
input_as_number (bool)
shared_min_cutoff (int)
specific_max_cutoff (int | None)
filter_min_len (int)

Return type:

None

depth_nodes_stat(nb_samples_gfa, filter_min_len=1)[source]

Calculates and displays statistics about segment depth (number of unique samples a segment is found in). Outputs results to console and a CSV file.

Parameters

nb_samples_gfaint: Total number of unique samples in the GFA, used for context if needed (not directly in calcs here).
filter_min_lenint, optional: Minimum length (bp) for a segment to be included in the filtered depth analysis. Defaults to 1 (no effective length filter).

Parameters:

nb_samples_gfa (int)
filter_min_len (int)

Return type:

None

get_specific_and_shared_segments(samples_list_A, samples_list_B=None, filter_min_len=None, output_csv=None)[source]

Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.

Parameters

samples_list_AList[str]: A list of sample names. A segment is “shared” if it is present in ALL samples in this list.
samples_list_BOptional[List[str]], optional: An optional second list of sample names. If provided, a segment is “specific” if it is shared by all in samples_list_A AND absent from ALL samples in this list.
filter_min_lenOptional[int], optional: If set, only segments with a length greater than or equal to this value will be considered.
output_csvOptional[bool], optional: If True, the function will return sets of the shared and specific segment IDs.

Returns

Tuple[Set[str], Set[str]]

A set of segment IDs that are shared by all samples in samples_list_A.
A set of segment IDs that are specific to samples_list_A relative to samples_list_B. (This set is a subset of the first one).

Parameters:

samples_list_A (List[str])
samples_list_B (List[str] | None)
filter_min_len (int | None)
output_csv (bool | None)

Return type:

Tuple[Set[str], Set[str]]

get_segments_and_positions_by_depth(total_gfa_samples, input_as_number, lower_bound_depth, upper_bound_depth, filter_min_len, bed_path=None)[source]

Finds segments within a specific depth range and retrieves their genomic positions from BED files.

This function performs two main steps:

Scans the BAM file to identify segments that meet the specified depth and length criteria.

For those segments, it efficiently queries the relevant BED files to find their

exact genomic coordinates (chromosome, start, end).

Parameters

total_gfa_samplesint: Total number of unique samples in the GFA, used for percentage calculations.
input_as_numberbool: If True, depth bounds are absolute counts; if False, they are percentages of total_gfa_samples.
lower_bound_depthint: Minimum sample depth (count or percentage) for a segment to be included.
upper_bound_depthint: Maximum sample depth (count or percentage) for a segment to be included.
filter_min_lenint: Minimum length in base pairs for a segment to be considered.
bed_pathPath, optional: Path to the directory containing the sample-specific BED files.

Returns

Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]: A tuple containing two dictionaries: 1. segments_with_depth: {segment_id: depth} for all segments matching the criteria. 2. segment_locations: {segment_id: {sample_name: [(chrom, start, end), …]}}

Parameters:

total_gfa_samples (int)
input_as_number (bool)
lower_bound_depth (int)
upper_bound_depth (int)
filter_min_len (int)
bed_path (Path)

Return type:

Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]

export_nodes_to_csv(output_csv_path, core_threshold_percent=0.95)[source]

Exports enhanced node information to a CSV file using an efficient, single-pass approach.

This method aggregates data in memory before creating a final Pandas DataFrame, making it much more memory-efficient than building a list of all records. It correctly parses the ‘SW’ tag to build a list of samples for each unique node.

Parameters

output_csv_pathPath: The path where the output CSV file will be saved.
core_threshold_percentfloat, optional: The percentage of total samples above which a node is considered ‘core’. Defaults to 0.95.

Parameters:

output_csv_path (Path)
core_threshold_percent (float)

Return type:

None

Parameters:

bam_path (Path)
threads (int)
logger (Logger)
suffix (str | None)
works_path (Path | None)
gfa_name (str | None)
tagging (bool)
disable_progress_flag (bool)

class gratools.ThreadedRichHandler(*args, **kwargs)[source]

Bases: RichHandler

Custom RichHandler that processes log records in a separate thread to prevent blocking the main application thread, especially during I/O-bound logging operations (like writing to console or complex formatting).

It also includes a feature to trigger a program exit if a log record of ERROR level or higher is emitted through it.

Attributes

runningbool: A flag indicating whether the log processing worker thread should continue running.
log_queuequeue.Queue[logging.LogRecord] # Type hint for clarity: A thread-safe queue used to buffer log records before they are processed by the worker thread.
worker_threadthreading.Thread: The background thread responsible for consuming log records from log_queue.
critical_error_occurredbool: A flag set to True if an ERROR or CRITICAL log has been processed, leading to program termination.

__init__(*args, **kwargs)[source]

Initializes the ThreadedRichHandler.

Sets up the log queue and starts the background worker thread.

Return type:: None

emit(record)[source]

Queues a log record for processing by the worker thread.

If the log record’s level is ERROR or higher, this method sets a flag to indicate a critical error and initiates the process to stop the log processing and exit the application.

Parameters

recordlogging.LogRecord: The log record to be emitted.

Parameters:: record (LogRecord)
Return type:: None

stop_processing()[source]

Signals the worker thread to stop and waits for it to terminate.

This ensures that the queue is flushed of pending log records before the thread exits.

Return type:: None

close()[source]

Closes the handler, ensuring the worker thread is stopped and resources are released.

Return type:: None

gratools.configure_logger(name, log_dir_path, verbosity_level, file_suffix='')[source]

Configures and returns a logger instance with specified settings.

The logger will output to: 1. The console (via RichHandler for formatted, colorful output). 2. A general log file (e.g., ‘name_log.o’). 3. An error log file for WARNING and higher messages (e.g., ‘name_log.e’).

Parameters

namestr: The name for the logger (e.g., “GraTools”).
log_dir_pathPath: The directory path where log files will be stored.
verbosity_levelstr: The logging verbosity level (e.g., “DEBUG”, “INFO”, “ERROR”). This sets the minimum level for messages to be processed by the logger.
file_suffixstr, optional: An optional suffix to append to the base name of log files. Defaults to an empty string.

Returns

logging.Logger: The configured logger instance.

Parameters:

name (str)
log_dir_path (Path)
verbosity_level (str)
file_suffix (str)

Return type:

Logger

gratools.update_logger_file_suffix(logger, new_file_suffix)[source]

Updates file handlers of a logger to use a new file suffix. Existing log content from the old file (if any) is copied to the new file location before switching. The old file is then deleted.

This is useful if, for instance, a specific operation (like a query) should have its logs in a uniquely named file determined mid-execution.

Parameters

loggerlogging.Logger: The logger instance whose file handlers need updating.
new_file_suffixstr: The new suffix to be incorporated into the log filenames. Example: if old was “main_log.log”, new_suffix=”_query123”, new becomes “main_log_query123.log”.

Returns

logging.Logger: The same logger instance, now with updated file handlers.

Parameters:

logger (Logger)
new_file_suffix (str)

Return type:

Logger

class gratools.CustomCommand(*args, **kwargs)[source]

Bases: Command

Custom Command class that applies global context settings (for help formatting) and prepends a specific header to the help message of individual commands.

__init__(*args, **kwargs)[source]

Initializes the CustomCommand.

Ensures that the predefined CONTEXT_SETTINGS are applied by default to this command.

invoke(ctx)[source]: Mesure et affiche le temps d’exécution autour de l’appel réel.

get_help(ctx)[source]

Overrides the default help generation to prepend a custom header.

The header is printed directly to the console using shared_console before the standard help text is generated and returned.

Parameters:: ctx (click.Context) – The current Click context.
Returns:: str – The formatted help text, including the prepended header.
Return type:: str

class gratools.CustomGroup(*args, **kwargs)[source]

Bases: Group

Custom Group class that applies global context settings, prepends a header to its own help message, and ensures that all subcommands added via its command decorator use CustomCommand by default.

__init__(*args, **kwargs)[source]

Initializes the CustomGroup.

Ensures that the predefined CONTEXT_SETTINGS are applied by default to this group and its subcommands (if they don’t override).

command(*args, **kwargs)[source]

Overrides the default command decorator registration.

This ensures that any command registered using this group’s command method will automatically use CustomCommand as its class, thereby inheriting the custom help formatting and header.

Parameters:

*args – Positional arguments for the command decorator.
**kwargs – Keyword arguments for the command decorator.

Returns:

Callable – The decorator that registers the command.

get_help(ctx)[source]

Overrides the default help generation for the group to prepend a custom header.

Parameters:: ctx (click.Context) – The current Click context.
Returns:: str – The formatted help text for the group, including the prepended header.
Return type:: str

gratools.validate_percentage_or_int(ctx, param, value)[source]

Click callback for validating that an option’s value is either an integer or a float representing a percentage (between 0.0 and 1.0 inclusive).

This function is intended to be used as a callback for a Click option.

Parameters:

ctx (click.Context) – The current Click context.
param (click.Parameter) – The Click parameter (option) being validated.
value (str | int | float) – The input value provided by the user for the option. Click might pass it as str, or already converted if type is set.

Returns:

int | float –

The validated value, converted to int if it’s a whole number,: or float if it’s a percentage (0.0-1.0).

Raises:

click.BadParameter – If the value is not a valid integer or a float between 0.0 and 1.0.

Return type:

int | float

—

💡 Navigation Tip

Use the Table of Contents in the sidebar to jump quickly to a specific function or class within these modules. Each member is indexed and searchable.