SubGraph

class gratools.SubGraph(bam_path, bed_path, logger=<factory>, sample_name=None, sample_name_query=None, chromosome_query=None, start_query=None, stop_query=None, offset_first=0, offset_last=0, add_start_bases_first_segment=0, intersect_bed=None, segment_id_set=<factory>, segment_id_first_query=None, segment_id_first_strand=None, segment_id_last_query=None, segment_id_last_strand=None, segment_id_first=None, segment_id_last=None, works_path=None, merge=None, build_fasta_flag=False, gfa_walk_list=<factory>, gfa_link_list=<factory>, gfa_segment_list=<factory>, dict_segments_samples=<factory>, dict_segments_sequence=<factory>, sequences_list=<factory>, progress_dict=None, task_id=None, regions=None, intersected_results_by_regions=None)[source]

Bases: object

A class representing a subgraph in a genomic analysis pipeline. It handles operations like BED file intersection, GFA walk/link/segment building, and FASTA sequence generation for a specific sample or region.

Attributes

bam_pathPath

The file path to the BAM file.

bed_pathPath

The file path to the BED file.

loggerlogging.Logger

The logger instance for logging messages. Defaults to a logger named “GraTools”.

sample_nameOptional[str]

The name of the sample, derived from bed_path. Default to None.

sample_name_queryOptional[str]

The name of the sample being queried. Default to None.

chromosome_queryOptional[str]

The chromosome name for the query. Default to None.

start_queryOptional[int]

The start position on the chromosome for the query. Default to None.

stop_queryOptional[int]

The stop position on the chromosome for the query. Default to None.

offset_firstint

The offset for the first segment in the query region. Default to 0.

offset_lastint

The offset for the last segment in the query region. Default to 0.

add_start_bases_first_segmentint

Additional start bases for the first segment (usage seems specific, consider clarifying). Default to 0.

intersect_bedOptional[BedTool]

The BedTool object for intersected BED regions. Default to None.

segment_id_setSet[str]

A set storing unique segment IDs encountered during walk building.

segment_id_first_queryOptional[str]

The ID of the first segment in the query region. Default to None.

segment_id_first_strandOptional[str]

The strand (‘+’ or ‘-’) of the first segment ID in the query. Default to None.

segment_id_last_queryOptional[str]

The ID of the last segment ID in the query region. Default to None.

segment_id_last_strandOptional[str]

The strand (‘+’ or ‘-’) of the last segment ID in the query. Default to None.

segment_id_firstOptional[str]

The first segment ID encountered for the current sample (might be same as query). Default to None.

segment_id_lastOptional[str]

The last segment ID encountered for the current sample (might be same as query). Default to None.

works_pathOptional[Path]

The working directory path. Defaults to None.

mergeOptional[int]

Merge distance parameter for BED operations (e.g., bedtools merge -d). Default to None.

build_fasta_flagOptional[bool]

Flag indicating whether FASTA sequences should be built. Default to False.

gfa_walk_listList[str]

A list of GFA walk strings (W lines). Default to an empty list.

gfa_link_listList[str]

A list of GFA link strings (L lines). Default to an empty list.

gfa_segment_listList[str]

A list of GFA segment strings (S lines). Default to an empty list.

dict_segments_samplesdefaultdict[str, List[str]]

A dictionary mapping segment IDs to a list of sample identifiers. Default to an empty defaultdict.

dict_segments_sequencedefaultdict[str, str]

A dictionary mapping segment IDs to their sequences. Default to an empty defaultdict.

sequences_listList[SeqRecord]

A list of Biopython SeqRecord objects for generated FASTA sequences. Default to an empty list.

progress_dictOptional[Dict]

A dictionary to track progress, typically for multi-processing. Default to None.

task_idOptional[TaskID]

The task ID for progress tracking with rich.progress. Default to None.

regionsOptional[List[Dict[str, Any]]]

A list of regions (dictionaries with ‘chromosome’, ‘start’, ‘stop’). Default to None.

intersected_results_by_regionsOptional[BedTool]

A BedTool object containing combined intersected results for all regions. Default to None.

Attributes Summary

add_start_bases_first_segment

build_fasta_flag

chromosome_query

intersect_bed

intersected_results_by_regions

merge

offset_first

offset_last

progress_dict

regions

sample_name

sample_name_query

segment_id_first

segment_id_first_query

segment_id_first_strand

segment_id_last

segment_id_last_query

segment_id_last_strand

start_query

stop_query

task_id

works_path

Methods Summary

build_fasta()

Build FASTA sequences from the GFA walks (self.gfa_walk_list).

build_segments()

Build GFA segments (S lines) from the BAM file for segments in self.segment_id_set.

build_walks()

Build GFA walks (W lines) and links (L lines) from the intersected BED regions.

compute_intersection()

Compute the intersection of BED regions for each region in self.regions.

filter_bed_with_awk()

Filter a BED file using an awk command line to extract only lines containing an ID of interest.

get_chr_pos(progress_dict, task_id)

Identify chromosomal regions corresponding to segments in self.segment_id_set, then compute intersections, and finally build walks and segments for these regions.

Attributes Documentation

add_start_bases_first_segment: int = 0
build_fasta_flag: bool | None = False
chromosome_query: str | None = None
intersect_bed: BedTool | None = None
intersected_results_by_regions: BedTool | None = None
merge: int | None = None
offset_first: int = 0
offset_last: int = 0
progress_dict: Dict | None = None
regions: List[Dict[str, Any]] | None = None
sample_name: str | None = None
sample_name_query: str | None = None
segment_id_first: str | None = None
segment_id_first_query: str | None = None
segment_id_first_strand: str | None = None
segment_id_last: str | None = None
segment_id_last_query: str | None = None
segment_id_last_strand: str | None = None
start_query: int | None = None
stop_query: int | None = None
task_id: TaskID | None = None
works_path: Path | None = None

Methods Documentation

build_fasta()[source]

Build FASTA sequences from the GFA walks (self.gfa_walk_list).

This method recovers FASTA sequences by processing each walk, extracting the constitutive segments, and applying specific offsets if the current sample is the query sample (self.sample_name == self.sample_name_query). The offsets trim segment sequences at the beginning or end of the walk to match the precise query region boundaries (self.start_query, self.stop_query). The resulting sequences are stored as SeqRecord objects in self.sequences_list.

Return type:

None

build_segments()[source]

Build GFA segments (S lines) from the BAM file for segments in self.segment_id_set.

Return type:

None

build_walks()[source]

Build GFA walks (W lines) and links (L lines) from the intersected BED regions.

Return type:

None

compute_intersection()[source]

Compute the intersection of BED regions for each region in self.regions. Store the results in self.intersected_results_by_regions.

Return type:

None

filter_bed_with_awk()[source]

Filter a BED file using an awk command line to extract only lines containing an ID of interest.

Return type:

BedTool

get_chr_pos(progress_dict, task_id)[source]

Identify chromosomal regions corresponding to segments in self.segment_id_set, then compute intersections, and finally build walks and segments for these regions.

Parameters

progress_dictOptional[Dict]

A dictionary to track progress for multiprocessing.

task_idOptional[TaskID]

The task ID for rich.progress tracking.

Parameters:
  • progress_dict (Dict | None)

  • task_id (TaskID | None)

Return type:

None

Parameters:
  • bam_path (Path)

  • bed_path (Path)

  • logger (Logger)

  • sample_name (str | None)

  • sample_name_query (str | None)

  • chromosome_query (str | None)

  • start_query (int | None)

  • stop_query (int | None)

  • offset_first (int)

  • offset_last (int)

  • add_start_bases_first_segment (int)

  • intersect_bed (BedTool | None)

  • segment_id_set (Set[str])

  • segment_id_first_query (str | None)

  • segment_id_first_strand (str | None)

  • segment_id_last_query (str | None)

  • segment_id_last_strand (str | None)

  • segment_id_first (str | None)

  • segment_id_last (str | None)

  • works_path (Path | None)

  • merge (int | None)

  • build_fasta_flag (bool | None)

  • gfa_walk_list (List[str])

  • gfa_link_list (List[str])

  • gfa_segment_list (List[str])

  • dict_segments_samples (defaultdict)

  • dict_segments_sequence (defaultdict)

  • sequences_list (List[SeqRecord])

  • progress_dict (Dict | None)

  • task_id (TaskID | None)

  • regions (List[Dict[str, Any]] | None)

  • intersected_results_by_regions (BedTool | None)