GratoolsBam
- class gratools.GratoolsBam(bam_path, threads=1, logger=<factory>, suffix=None, works_path=None, gfa_name=None, tagging=False)[source]
Bases:
objectHandles operations related to BAM files in the GraTools context, such as indexing, extracting segment information, tagging segments, and performing various analyses (core/dispensable ratio, depth statistics, etc.).
Attributes
- bam_pathPath
Path to the BAM file.
- threadsint, optional
Number of threads for BAM operations (e.g., reading, indexing). Defaults to 1.
- loggerlogging.Logger
Logger instance. Defaults to a logger named “GraTools”.
- suffixOptional[str], optional
Suffix to append to output filenames generated by analyses. Defaults to None.
- works_pathOptional[Path], optional
Working directory path for saving output files. Defaults to None (uses BAM parent dir).
- gfa_nameOptional[str], optional
Name of the associated GFA file (used for naming output files). Defaults to None.
- taggingbool, optional
If True, indicates that operations might modify tags, potentially requiring re-indexing. Used by index_bam to decide if indexing is needed. Defaults to False.
- progressOptional[Progress]
Rich Progress instance for displaying progress. Auto-initialized.
Attributes Summary
Methods Summary
build_segments([list_segments])Extracts specified segments from a BAM file and reconstructs their GFA S-line representation.
core_dispensable_ratio(nb_samples_gfa, ...)Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in.
depth_nodes_stat(nb_samples_gfa[, ...])Calculates and displays statistics about segment depth (number of unique samples a segment is found in).
export_nodes_to_csv(output_csv_path[, ...])Exports information about each segment (node) in the BAM file to a CSV file.
get_segments_and_positions_by_depth(...[, ...])Finds segments within a specific depth range and retrieves their genomic positions from BED files.
get_specific_and_shared_segments(samples_list_A)Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.
Indexes the BAM file using pysam.index if the index is missing or outdated.
tag(dict_segments_samples)Adds or updates the 'SW' (Sample Walks) tag to segments in the BAM file.
Attributes Documentation
Methods Documentation
- build_segments(list_segments=None)[source]
Extracts specified segments from a BAM file and reconstructs their GFA S-line representation. Also populates dictionaries for segment samples and sequences.
Parameters
- list_segmentsOptional[List[str]], optional
A list of segment IDs (query_name in BAM) to extract. If None or empty, this method might process all segments or return empty results, depending on intended behavior (current pysam.view call implies it needs a list).
Returns
- Tuple[List[str], defaultdict[str, List[str]], defaultdict[str, str]]
gfa_s_lines_list: List of strings, each a GFA S-line.
dict_seg_samples: defaultdict mapping segment ID to list of “sample;chrom;haplo” strings from SW tag.
dict_seg_sequence: defaultdict mapping segment ID to its sequence.
- Parameters:
- Return type:
- core_dispensable_ratio(nb_samples_gfa, input_as_number, shared_min_cutoff=1, specific_max_cutoff=None, filter_min_len=1)[source]
Analyzes segments in the BAM file to determine core (shared) and dispensable (specific) ratios based on the number of samples a segment is found in. Saves results to a CSV file.
Parameters
- nb_samples_gfaint
Total number of unique samples present in the GFA (used for percentage calculation).
- input_as_numberbool
If True, shared_min_cutoff and specific_max_cutoff are treated as absolute counts of samples. If False, they are treated as percentages of nb_samples_gfa.
- shared_min_cutoffint, optional
Minimum number/percentage of samples a segment must be in to be considered “shared” (core). Defaults to 1.
- specific_max_cutoffOptional[int], optional
Maximum number/percentage of samples a segment can be in to be considered “specific” (dispensable). If None, specific analysis might be skipped or use a default (e.g., 1 if input_as_number). Defaults to None.
- filter_min_lenint, optional
Minimum length (bp) for a segment to be included in the filtered analysis. Defaults to 1 (no length filter).
- depth_nodes_stat(nb_samples_gfa, filter_min_len=1)[source]
Calculates and displays statistics about segment depth (number of unique samples a segment is found in). Outputs results to console and a CSV file.
Parameters
- nb_samples_gfaint
Total number of unique samples in the GFA, used for context if needed (not directly in calcs here).
- filter_min_lenint, optional
Minimum length (bp) for a segment to be included in the filtered depth analysis. Defaults to 1 (no effective length filter).
- export_nodes_to_csv(output_csv_path, long_node_length_threshold=1000)[source]
Exports information about each segment (node) in the BAM file to a CSV file. Includes node name, length, sample IDs (from SW tag), inferred direction, and a flag if it’s a “long” node.
Parameters
- output_csv_pathPath
Path where the output CSV file will be saved.
- long_node_length_thresholdint, optional
Length threshold (bp) to classify a node as “long”. Defaults to 1000.
- get_segments_and_positions_by_depth(total_gfa_samples, input_as_number, lower_bound_depth, upper_bound_depth, filter_min_len, bed_path=None)[source]
Finds segments within a specific depth range and retrieves their genomic positions from BED files.
This function performs two main steps: 1. Scans the BAM file to identify segments that meet the specified depth and length criteria. 2. For those segments, it efficiently queries the relevant BED files to find their exact genomic coordinates (chromosome, start, end).
Parameters
- total_gfa_samplesint
Total number of unique samples in the GFA, used for percentage calculations.
- input_as_numberbool
If True, depth bounds are absolute counts; if False, they are percentages of total_gfa_samples.
- lower_bound_depthint
Minimum sample depth (count or percentage) for a segment to be included.
- upper_bound_depthint
Maximum sample depth (count or percentage) for a segment to be included.
- filter_min_lenint
Minimum length in base pairs for a segment to be considered.
- bed_pathPath, optional
Path to the directory containing the sample-specific BED files.
Returns
- Tuple[Dict[str, int], Dict[str, Dict[str, List[Tuple[str, int, int]]]]]
A tuple containing two dictionaries: 1. segments_with_depth: {segment_id: depth} for all segments matching the criteria. 2. segment_locations: {segment_id: {sample_name: [(chrom, start, end), …]}}
Identifies and counts segments that are shared among one group of samples and specific to that group relative to a second group.
Parameters
- samples_list_AList[str]
A list of sample names. A segment is “shared” if it is present in ALL samples in this list.
- samples_list_BOptional[List[str]], optional
An optional second list of sample names. If provided, a segment is “specific” if it is shared by all in samples_list_A AND absent from ALL samples in this list.
- filter_min_lenOptional[int], optional
If set, only segments with a length greater than or equal to this value will be considered.
- output_csvOptional[bool], optional
If True, the function will return sets of the shared and specific segment IDs.
Returns
- Tuple[Set[str], Set[str]]
A set of segment IDs that are shared by all samples in samples_list_A.
A set of segment IDs that are specific to samples_list_A relative to samples_list_B. (This set is a subset of the first one).
- index_bam()[source]
Indexes the BAM file using pysam.index if the index is missing or outdated. The index file will have a ‘.bai’ or ‘.crai’ extension depending on BAM format.
- Return type:
None
- tag(dict_segments_samples)[source]
Adds or updates the ‘SW’ (Sample Walks) tag to segments in the BAM file. The SW tag stores a comma-separated list of “sample;chromosome;haplotype” strings indicating which walks/paths contain the segment. The original BAM file is overwritten with the tagged version.
Parameters
- dict_segments_samplesDict[str, List[str]]
Dictionary mapping segment IDs (query_name) to a list of “sample;chromosome;haplotype” strings.
Returns
- Path
The path to the (now tagged and re-indexed) BAM file.
Raises
- FileNotFoundError
If the input BAM file does not exist.
- Exception
If errors occur during BAM reading, writing, or renaming.