GratoolsBam

class gratools.GratoolsBam(bam_path, threads=1, logger=<factory>, suffix=None, works_path=None, gfa_name=None, tagging=False)[source]

Bases: object

Handles operations related to BAM files in the GraTools context, such as indexing, extracting segment information, tagging segments, and performing various analyses (core/dispensable ratio, depth statistics, etc.).

Attributes

bam_pathPath

Path to the BAM file.

threadsint, optional

Number of threads for BAM operations (e.g., reading, indexing). Defaults to 1.

loggerlogging.Logger

Logger instance. Defaults to a logger named “GraTools”.

suffixOptional[str], optional

Suffix to append to output filenames generated by analyses. Defaults to None.

works_pathOptional[Path], optional

Working directory path for saving output files. Defaults to None (uses BAM parent dir).

gfa_nameOptional[str], optional

Name of the associated GFA file (used for naming output files). Defaults to None.

taggingbool, optional

If True, indicates that operations might modify tags, potentially requiring re-indexing. Used by index_bam to decide if indexing is needed. Defaults to False.

progressOptional[Progress]

Rich Progress instance for displaying progress. Auto-initialized.

Attributes Summary

gfa_name

suffix

tagging

threads

works_path

Methods Summary

build_segments([list_segments])

Extracts specified segments from a BAM file and reconstructs their GFA S-line representation.

core_dispensable_ratio(nb_samples_gfa, ...)

Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in.

depth_nodes_stat(nb_samples_gfa[, ...])

Calculates and displays statistics about segment depth (number of unique samples a segment is found in).

export_nodes_to_csv(output_csv_path[, ...])

Exports information about each segment (node) in the BAM file to a CSV file.

get_segments_and_positions_by_depth(...[, ...])

Finds segments within a specific depth range and retrieves their genomic positions from BED files.

get_specific_and_shared_segments(samples_list_A)

Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.

index_bam()

Indexes the BAM file using pysam.index if the index is missing or outdated.

tag(dict_segments_samples)

Adds or updates the 'SW' (Sample Walks) tag to segments in the BAM file.

Attributes Documentation

gfa_name: str | None = None
suffix: str | None = None
tagging: bool = False
threads: int = 1
works_path: Path | None = None

Methods Documentation

build_segments(list_segments=None)[source]

Extracts specified segments from a BAM file and reconstructs their GFA S-line representation. Also populates dictionaries for segment samples and sequences.

Parameters

list_segmentsOptional[List[str]], optional

A list of segment IDs (query_name in BAM) to extract. If None or empty, this method might process all segments or return empty results, depending on intended behavior (current pysam.view call implies it needs a list).

Returns

Tuple[List[str], defaultdict[str, List[str]], defaultdict[str, str]]
  • gfa_s_lines_list: List of strings, each a GFA S-line.

  • dict_seg_samples: defaultdict mapping segment ID to list of “sample;chrom;haplo” strings from SW tag.

  • dict_seg_sequence: defaultdict mapping segment ID to its sequence.

Parameters:

list_segments (List[str] | None)

Return type:

Tuple[List[str], defaultdict, defaultdict]

core_dispensable_ratio(nb_samples_gfa, input_as_number, shared_min_cutoff=1, specific_max_cutoff=None, filter_min_len=1)[source]

Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in. Saves results to a CSV file.

Parameters

nb_samples_gfaint

Total number of unique samples present in the GFA (used for percentage calculation).

input_as_numberbool

If True, shared_min_cutoff and specific_max_cutoff are treated as absolute counts of samples. If False, they are treated as percentages of nb_samples_gfa.

shared_min_cutoffint, optional

Minimum number/percentage of samples a segment must be in to be considered “shared” (core). Defaults to 1.

specific_max_cutoffOptional[int], optional

Maximum number/percentage of samples a segment can be in to be considered “specific” (dispensable). If None, specific analysis might be skipped or use a default (e.g., 1 if input_as_number). Defaults to None.

filter_min_lenint, optional

Minimum length (bp) for a segment to be included in the filtered analysis. Defaults to 1 (no length filter).

Parameters:
  • nb_samples_gfa (int)

  • input_as_number (bool)

  • shared_min_cutoff (int)

  • specific_max_cutoff (int | None)

  • filter_min_len (int)

Return type:

None

depth_nodes_stat(nb_samples_gfa, filter_min_len=1)[source]

Calculates and displays statistics about segment depth (number of unique samples a segment is found in). Outputs results to console and a CSV file.

Parameters

nb_samples_gfaint

Total number of unique samples in the GFA, used for context if needed (not directly in calcs here).

filter_min_lenint, optional

Minimum length (bp) for a segment to be included in the filtered depth analysis. Defaults to 1 (no effective length filter).

Parameters:
  • nb_samples_gfa (int)

  • filter_min_len (int)

Return type:

None

export_nodes_to_csv(output_csv_path, long_node_length_threshold=1000)[source]

Exports information about each segment (node) in the BAM file to a CSV file. Includes node name, length, sample IDs (from SW tag), inferred direction, and a flag if it’s a “long” node.

Parameters

output_csv_pathPath

Path where the output CSV file will be saved.

long_node_length_thresholdint, optional

Length threshold (bp) to classify a node as “long”. Defaults to 1000.

Parameters:
  • output_csv_path (Path)

  • long_node_length_threshold (int)

Return type:

None

get_segments_and_positions_by_depth(total_gfa_samples, input_as_number, lower_bound_depth, upper_bound_depth, filter_min_len, bed_path=None)[source]

Finds segments within a specific depth range and retrieves their genomic positions from BED files.

This function performs two main steps: 1. Scans the BAM file to identify segments that meet the specified depth and length criteria. 2. For those segments, it efficiently queries the relevant BED files to find their exact genomic coordinates (chromosome, start, end).

Parameters

total_gfa_samplesint

Total number of unique samples in the GFA, used for percentage calculations.

input_as_numberbool

If True, depth bounds are absolute counts; if False, they are percentages of total_gfa_samples.

lower_bound_depthint

Minimum sample depth (count or percentage) for a segment to be included.

upper_bound_depthint

Maximum sample depth (count or percentage) for a segment to be included.

filter_min_lenint

Minimum length in base pairs for a segment to be considered.

bed_pathPath, optional

Path to the directory containing the sample-specific BED files.

Returns

Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]

A tuple containing two dictionaries: 1. segments_with_depth: {segment_id: depth} for all segments matching the criteria. 2. segment_locations: {segment_id: {sample_name: [(chrom, start, end), …]}}

Parameters:
  • total_gfa_samples (int)

  • input_as_number (bool)

  • lower_bound_depth (int)

  • upper_bound_depth (int)

  • filter_min_len (int)

  • bed_path (Path)

Return type:

Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]

get_specific_and_shared_segments(samples_list_A, samples_list_B=None, filter_min_len=None, output_csv=None)[source]

Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.

Parameters

samples_list_AList[str]

A list of sample names. A segment is “shared” if it is present in ALL samples in this list.

samples_list_BOptional[List[str]], optional

An optional second list of sample names. If provided, a segment is “specific” if it is shared by all in samples_list_A AND absent from ALL samples in this list.

filter_min_lenOptional[int], optional

If set, only segments with a length greater than or equal to this value will be considered.

output_csvOptional[bool], optional

If True, the function will return sets of the shared and specific segment IDs.

Returns

Tuple[Set[str], Set[str]]
  • A set of segment IDs that are shared by all samples in samples_list_A.

  • A set of segment IDs that are specific to samples_list_A relative to samples_list_B. (This set is a subset of the first one).

Parameters:
  • samples_list_A (List[str])

  • samples_list_B (List[str] | None)

  • filter_min_len (int | None)

  • output_csv (bool | None)

Return type:

Tuple[Set[str], Set[str]]

index_bam()[source]

Indexes the BAM file using pysam.index if the index is missing or outdated. The index file will have a ‘.bai’ or ‘.crai’ extension depending on BAM format.

Return type:

None

tag(dict_segments_samples)[source]

Adds or updates the ‘SW’ (Sample Walks) tag to segments in the BAM file. The SW tag stores a comma-separated list of “sample;chromosome;haplotype” strings indicating which walks/paths contain the segment. The original BAM file is overwritten with the tagged version.

Parameters

dict_segments_samplesDict[str, List[str]]

Dictionary mapping segment IDs (query_name) to a list of “sample;chromosome;haplotype” strings.

Returns

Path

The path to the (now tagged and re-indexed) BAM file.

Raises

FileNotFoundError

If the input BAM file does not exist.

Exception

If errors occur during BAM reading, writing, or renaming.

Parameters:

dict_segments_samples (Dict[str, List[str]])

Return type:

Path

Parameters: