GratoolsBam

class gratools.GratoolsBam(bam_path, threads=1, logger=<factory>, suffix=None, works_path=None, gfa_name=None, tagging=False)[source]

Bases: object

Handles operations related to BAM files in the GraTools context, such as indexing, extracting segment information, tagging segments, and performing various analyses (core/dispensable ratio, depth statistics, etc.).

Attributes

bam_pathPath: Path to the BAM file.
threadsint, optional: Number of threads for BAM operations (e.g., reading, indexing). Defaults to 1.
loggerlogging.Logger: Logger instance. Defaults to a logger named “GraTools”.
suffixOptional[str], optional: Suffix to append to output filenames generated by analyses. Defaults to None.
works_pathOptional[Path], optional: Working directory path for saving output files. Defaults to None (uses BAM parent dir).
gfa_nameOptional[str], optional: Name of the associated GFA file (used for naming output files). Defaults to None.
taggingbool, optional: If True, indicates that operations might modify tags, potentially requiring re-indexing. Used by index_bam to decide if indexing is needed. Defaults to False.
progressOptional[Progress]: Rich Progress instance for displaying progress. Auto-initialized.

Attributes Summary

`gfa_name`
`suffix`
`tagging`
`threads`
`works_path`

Methods Summary

`build_segments`([list_segments])	Extracts specified segments from a BAM file and reconstructs their GFA S-line representation.
`core_dispensable_ratio`(nb_samples_gfa, ...)	Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in.
`depth_nodes_stat`(nb_samples_gfa[, ...])	Calculates and displays statistics about segment depth (number of unique samples a segment is found in).
`export_nodes_to_csv`(output_csv_path[, ...])	Exports information about each segment (node) in the BAM file to a CSV file.
`get_segments_and_positions_by_depth`(...[, ...])	Finds segments within a specific depth range and retrieves their genomic positions from BED files.
`get_specific_and_shared_segments`(samples_list_A)	Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.
`index_bam`()	Indexes the BAM file using pysam.index if the index is missing or outdated.
`tag`(dict_segments_samples)	Adds or updates the 'SW' (Sample Walks) tag to segments in the BAM file.

Attributes Documentation

gfa_name: str | None = None

suffix: str | None = None

tagging: bool = False

threads: int = 1

works_path: Path | None = None

Methods Documentation

build_segments(list_segments=None)[source]

Extracts specified segments from a BAM file and reconstructs their GFA S-line representation. Also populates dictionaries for segment samples and sequences.

Parameters

list_segmentsOptional[List[str]], optional: A list of segment IDs (query_name in BAM) to extract. If None or empty, this method might process all segments or return empty results, depending on intended behavior (current pysam.view call implies it needs a list).

Returns

Tuple[List[str], defaultdict[str, List[str]], defaultdict[str, str]]

gfa_s_lines_list: List of strings, each a GFA S-line.
dict_seg_samples: defaultdict mapping segment ID to list of “sample;chrom;haplo” strings from SW tag.
dict_seg_sequence: defaultdict mapping segment ID to its sequence.

Parameters:: list_segments (List[str] | None)
Return type:: Tuple[List[str], defaultdict, defaultdict]

core_dispensable_ratio(nb_samples_gfa, input_as_number, shared_min_cutoff=1, specific_max_cutoff=None, filter_min_len=1)[source]

Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in. Saves results to a CSV file.

Parameters

nb_samples_gfaint: Total number of unique samples present in the GFA (used for percentage calculation).
input_as_numberbool: If True, shared_min_cutoff and specific_max_cutoff are treated as absolute counts of samples. If False, they are treated as percentages of nb_samples_gfa.
shared_min_cutoffint, optional: Minimum number/percentage of samples a segment must be in to be considered “shared” (core). Defaults to 1.
specific_max_cutoffOptional[int], optional: Maximum number/percentage of samples a segment can be in to be considered “specific” (dispensable). If None, specific analysis might be skipped or use a default (e.g., 1 if input_as_number). Defaults to None.
filter_min_lenint, optional: Minimum length (bp) for a segment to be included in the filtered analysis. Defaults to 1 (no length filter).

Parameters:

nb_samples_gfa (int)
input_as_number (bool)
shared_min_cutoff (int)
specific_max_cutoff (int | None)
filter_min_len (int)

Return type:

None

depth_nodes_stat(nb_samples_gfa, filter_min_len=1)[source]

Calculates and displays statistics about segment depth (number of unique samples a segment is found in). Outputs results to console and a CSV file.

Parameters

nb_samples_gfaint: Total number of unique samples in the GFA, used for context if needed (not directly in calcs here).
filter_min_lenint, optional: Minimum length (bp) for a segment to be included in the filtered depth analysis. Defaults to 1 (no effective length filter).

Parameters:

nb_samples_gfa (int)
filter_min_len (int)

Return type:

None

export_nodes_to_csv(output_csv_path, long_node_length_threshold=1000)[source]

Exports information about each segment (node) in the BAM file to a CSV file. Includes node name, length, sample IDs (from SW tag), inferred direction, and a flag if it’s a “long” node.

Parameters

output_csv_pathPath: Path where the output CSV file will be saved.
long_node_length_thresholdint, optional: Length threshold (bp) to classify a node as “long”. Defaults to 1000.

Parameters:

output_csv_path (Path)
long_node_length_threshold (int)

Return type:

None

get_segments_and_positions_by_depth(total_gfa_samples, input_as_number, lower_bound_depth, upper_bound_depth, filter_min_len, bed_path=None)[source]

Finds segments within a specific depth range and retrieves their genomic positions from BED files.

This function performs two main steps: 1. Scans the BAM file to identify segments that meet the specified depth and length criteria. 2. For those segments, it efficiently queries the relevant BED files to find their exact genomic coordinates (chromosome, start, end).

Parameters

total_gfa_samplesint: Total number of unique samples in the GFA, used for percentage calculations.
input_as_numberbool: If True, depth bounds are absolute counts; if False, they are percentages of total_gfa_samples.
lower_bound_depthint: Minimum sample depth (count or percentage) for a segment to be included.
upper_bound_depthint: Maximum sample depth (count or percentage) for a segment to be included.
filter_min_lenint: Minimum length in base pairs for a segment to be considered.
bed_pathPath, optional: Path to the directory containing the sample-specific BED files.

Returns

Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]: A tuple containing two dictionaries: 1. segments_with_depth: {segment_id: depth} for all segments matching the criteria. 2. segment_locations: {segment_id: {sample_name: [(chrom, start, end), …]}}

Parameters:

total_gfa_samples (int)
input_as_number (bool)
lower_bound_depth (int)
upper_bound_depth (int)
filter_min_len (int)
bed_path (Path)

Return type:

Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]

get_specific_and_shared_segments(samples_list_A, samples_list_B=None, filter_min_len=None, output_csv=None)[source]

Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.

Parameters

samples_list_AList[str]: A list of sample names. A segment is “shared” if it is present in ALL samples in this list.
samples_list_BOptional[List[str]], optional: An optional second list of sample names. If provided, a segment is “specific” if it is shared by all in samples_list_A AND absent from ALL samples in this list.
filter_min_lenOptional[int], optional: If set, only segments with a length greater than or equal to this value will be considered.
output_csvOptional[bool], optional: If True, the function will return sets of the shared and specific segment IDs.

Returns

Tuple[Set[str], Set[str]]

A set of segment IDs that are shared by all samples in samples_list_A.
A set of segment IDs that are specific to samples_list_A relative to samples_list_B. (This set is a subset of the first one).

Parameters:

samples_list_A (List[str])
samples_list_B (List[str] | None)
filter_min_len (int | None)
output_csv (bool | None)

Return type:

Tuple[Set[str], Set[str]]

index_bam()[source]

Indexes the BAM file using pysam.index if the index is missing or outdated. The index file will have a ‘.bai’ or ‘.crai’ extension depending on BAM format.

Return type:: None

tag(dict_segments_samples)[source]

Adds or updates the ‘SW’ (Sample Walks) tag to segments in the BAM file. The SW tag stores a comma-separated list of “sample;chromosome;haplotype” strings indicating which walks/paths contain the segment. The original BAM file is overwritten with the tagged version.

Parameters

dict_segments_samplesDict[str, List[str]]: Dictionary mapping segment IDs (query_name) to a list of “sample;chromosome;haplotype” strings.

Returns

Path: The path to the (now tagged and re-indexed) BAM file.

Raises

FileNotFoundError: If the input BAM file does not exist.
Exception: If errors occur during BAM reading, writing, or renaming.

Parameters:: dict_segments_samples (Dict[str, List[str]])
Return type:: Path

Parameters:

bam_path (Path)
threads (int)
logger (Logger)
suffix (str | None)
works_path (Path | None)
gfa_name (str | None)
tagging (bool)