SubGraph
- class gratools.SubGraph(bam_path, bed_path, logger=<factory>, sample_name=None, sample_name_query=None, chromosome_query=None, start_query=None, stop_query=None, offset_first=0, offset_last=0, add_start_bases_first_segment=0, intersect_bed=None, segment_id_set=<factory>, segment_id_first_query=None, segment_id_first_strand=None, segment_id_last_query=None, segment_id_last_strand=None, segment_id_first=None, segment_id_last=None, works_path=None, merge=None, build_fasta_flag=False, gfa_walk_list=<factory>, gfa_link_list=<factory>, gfa_segment_list=<factory>, dict_segments_samples=<factory>, dict_segments_sequence=<factory>, sequences_list=<factory>, progress_dict=None, task_id=None, regions=None, intersected_results_by_regions=None)[source]
Bases:
objectA class representing a subgraph in a genomic analysis pipeline. It handles operations like BED file intersection, GFA walk/link/segment building, and FASTA sequence generation for a specific sample or region.
Attributes
- bam_pathPath
The file path to the BAM file.
- bed_pathPath
The file path to the BED file.
- loggerlogging.Logger
The logger instance for logging messages. Defaults to a logger named “GraTools”.
- sample_nameOptional[str]
The name of the sample, derived from bed_path. Default to None.
- sample_name_queryOptional[str]
The name of the sample being queried. Default to None.
- chromosome_queryOptional[str]
The chromosome name for the query. Default to None.
- start_queryOptional[int]
The start position on the chromosome for the query. Default to None.
- stop_queryOptional[int]
The stop position on the chromosome for the query. Default to None.
- offset_firstint
The offset for the first segment in the query region. Default to 0.
- offset_lastint
The offset for the last segment in the query region. Default to 0.
- add_start_bases_first_segmentint
Additional start bases for the first segment (usage seems specific, consider clarifying). Default to 0.
- intersect_bedOptional[BedTool]
The BedTool object for intersected BED regions. Default to None.
- segment_id_setSet[str]
A set storing unique segment IDs encountered during walk building.
- segment_id_first_queryOptional[str]
The ID of the first segment in the query region. Default to None.
- segment_id_first_strandOptional[str]
The strand (‘+’ or ‘-’) of the first segment ID in the query. Default to None.
- segment_id_last_queryOptional[str]
The ID of the last segment ID in the query region. Default to None.
- segment_id_last_strandOptional[str]
The strand (‘+’ or ‘-’) of the last segment ID in the query. Default to None.
- segment_id_firstOptional[str]
The first segment ID encountered for the current sample (might be same as query). Default to None.
- segment_id_lastOptional[str]
The last segment ID encountered for the current sample (might be same as query). Default to None.
- works_pathOptional[Path]
The working directory path. Defaults to None.
- mergeOptional[int]
Merge distance parameter for BED operations (e.g., bedtools merge -d). Default to None.
- build_fasta_flagOptional[bool]
Flag indicating whether FASTA sequences should be built. Default to False.
- gfa_walk_listList[str]
A list of GFA walk strings (W lines). Default to an empty list.
- gfa_link_listList[str]
A list of GFA link strings (L lines). Default to an empty list.
- gfa_segment_listList[str]
A list of GFA segment strings (S lines). Default to an empty list.
- dict_segments_samplesdefaultdict[str, List[str]]
A dictionary mapping segment IDs to a list of sample identifiers. Default to an empty defaultdict.
- dict_segments_sequencedefaultdict[str, str]
A dictionary mapping segment IDs to their sequences. Default to an empty defaultdict.
- sequences_listList[SeqRecord]
A list of Biopython SeqRecord objects for generated FASTA sequences. Default to an empty list.
- progress_dictOptional[Dict]
A dictionary to track progress, typically for multi-processing. Default to None.
- task_idOptional[TaskID]
The task ID for progress tracking with rich.progress. Default to None.
- regionsOptional[List[Dict[str, Any]]]
A list of regions (dictionaries with ‘chromosome’, ‘start’, ‘stop’). Default to None.
- intersected_results_by_regionsOptional[BedTool]
A BedTool object containing combined intersected results for all regions. Default to None.
Attributes Summary
Methods Summary
Build FASTA sequences from the GFA walks (self.gfa_walk_list).
Build GFA segments (S lines) from the BAM file for segments in self.segment_id_set.
Build GFA walks (W lines) and links (L lines) from the intersected BED regions.
Compute the intersection of BED regions for each region in self.regions.
Filter a BED file using an awk command line to extract only lines containing an ID of interest.
get_chr_pos(progress_dict, task_id)Identify chromosomal regions corresponding to segments in self.segment_id_set, then compute intersections, and finally build walks and segments for these regions.
Attributes Documentation
Methods Documentation
- build_fasta()[source]
Build FASTA sequences from the GFA walks (self.gfa_walk_list).
This method recovers FASTA sequences by processing each walk, extracting the constitutive segments, and applying specific offsets if the current sample is the query sample (self.sample_name == self.sample_name_query). The offsets trim segment sequences at the beginning or end of the walk to match the precise query region boundaries (self.start_query, self.stop_query). The resulting sequences are stored as SeqRecord objects in self.sequences_list.
- Return type:
None
- build_segments()[source]
Build GFA segments (S lines) from the BAM file for segments in self.segment_id_set.
- Return type:
None
- build_walks()[source]
Build GFA walks (W lines) and links (L lines) from the intersected BED regions.
- Return type:
None
- compute_intersection()[source]
Compute the intersection of BED regions for each region in self.regions. Store the results in self.intersected_results_by_regions.
- Return type:
None
- filter_bed_with_awk()[source]
Filter a BED file using an awk command line to extract only lines containing an ID of interest.
- Return type:
BedTool
- get_chr_pos(progress_dict, task_id)[source]
Identify chromosomal regions corresponding to segments in self.segment_id_set, then compute intersections, and finally build walks and segments for these regions.
Parameters
- progress_dictOptional[Dict]
A dictionary to track progress for multiprocessing.
- task_idOptional[TaskID]
The task ID for rich.progress tracking.
- Parameters:
progress_dict (Dict | None)
task_id (TaskID | None)
- Return type:
None
- Parameters:
bam_path (Path)
bed_path (Path)
logger (Logger)
sample_name (str | None)
sample_name_query (str | None)
chromosome_query (str | None)
start_query (int | None)
stop_query (int | None)
offset_first (int)
offset_last (int)
add_start_bases_first_segment (int)
intersect_bed (BedTool | None)
segment_id_first_query (str | None)
segment_id_first_strand (str | None)
segment_id_last_query (str | None)
segment_id_last_strand (str | None)
segment_id_first (str | None)
segment_id_last (str | None)
works_path (Path | None)
merge (int | None)
build_fasta_flag (bool | None)
dict_segments_samples (defaultdict)
dict_segments_sequence (defaultdict)
sequences_list (List[SeqRecord])
progress_dict (Dict | None)
task_id (TaskID | None)
intersected_results_by_regions (BedTool | None)