gratools extract_subgraph
The extract_subgraph command extracts a valid subgraph from a larger GFA file based on a genomic region. The region is defined using the coordinates from a query sample, but the command can extract all paths traversing that region for the query sample, a specified list of samples, or all samples in the graph. The output is a new, smaller GFA file containing only the relevant segments and links.
Options
Usage Examples
Extracting a subgraph for a specific region
This example extracts a subgraph corresponding to the region on CG14_Chr07 from 100,000 to 150,000, including the paths for all samples (–all-samples) that traverse this region.
$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 --start-query 100000 --stop-query 150000 \
--all-samples --threads 4
╭──────────────────────────────────────── Global Tracker ──────────────────────────────────────╮
│ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 4/4 samples 0:00:04 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────────────── Samples Tracker ─────────────────────────────────────────────╮
│ 'Og103': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
│ 'Og182': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
│ 'Og20': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
│ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
01-13 14:25 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-100000-150000.gfa.gz
Extracting a subgraph for a specific region and generate a FASTA file from the subgraph
In addition to the GFA file, this command will also generate a FASTA file containing the sequences of all segments present in the extracted subgraph, using the –build-fasta flag.
$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 --start-query 0 --stop-query 50000 --build-fasta --all-samples
╭──────────────────────────────────── Global Tracker ──────────────────────────────────────────╮
│ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 4/4 samples 0:00:17 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────────── Samples Tracker ────────────────────────────────────────────────╮
│ 'Og103': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
│ 'Og182': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:08 0:00:00 │
│ 'Og20': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:12 0:00:00 │
│ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:17 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
01-13 14:55 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz
01-13 14:55 | INFO | Generated FASTA file: 'Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.fasta'
Warning
Note that the output fasta file will contains sequences from all samples of the subgraph (for a more versatile
method about fasta extraction, please use the get_fasta subcommand)
Extracting a subgraph for a list of samples provided in a file
This example extracts a subgraph for the specified region but only includes the paths for the samples listed in list_sample2extract.txt.
$ cat list_sample2extract.txt
Og182
Tog5681
$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 --start-query 0 --stop-query 50000 --samples-list list_sample2extract.txt --threads 4
╭─────────────────────────── Global Tracker ───────────────────────────────────────────────────╮
│ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 2/2 samples 0:00:04 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────── Samples Tracker ─────────────────────────────────────────────────────────╮
│ 'Og182': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
│ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
01-13 15:12 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz
Warning
The output filename is based on the query region, not the list of samples. Be aware that running this command with different sample lists but the same query region will overwrite previous results unless you specify a different output directory or suffix.
Illustrated Example
Understanding the Process
Define a Region: The
--sample-query,--chrom-query,--start-query, and--stop-queryoptions define a genomic interval using a specific sample’s coordinate system.Identify Paths: The tool identifies all paths for the selected samples (query sample, all samples, or a list) that pass through this genomic interval. The blue path in the graph below represents the traversal for the query sample.
![// Pangenome Representation
digraph PangenomeGraph {
rankdir=LR; // Set graph direction to left-to-right
// Define nodes with labels representing the genome structure
AC [label="AC"];
C1 [label="C"];
G1 [label="G"];
GTTAA [label=<G<b>T</b>TAA>, style="filled,rounded"];
G2 [label="G", style="filled,rounded"];
GG [label="GG", style="filled,rounded"];
C2 [label="C", style="filled,rounded"];
GATCG [label=<GA<b>T</b>CG>, style="filled,rounded"];
CTCG [label="CTCG"];
AA [label="AA"];
TTTT [label="TTTT"];
// Define edges showing relationships or paths between nodes
AC -> G1 [color="blue"];
AC -> C1 [color="red"];
G1 -> GTTAA [color="blue"];
C1 -> GTTAA [color="red"];
GTTAA -> G2 [color="blue"];
GTTAA -> GATCG [color="red"];
G2 -> C2 [color="red"];
G2 -> GG [color="blue"];
GG -> C2 [color="blue"];
C2 -> GATCG [color="blue"];
GATCG -> CTCG [color="blue"];
GATCG -> AA [color="red"];
AA -> CTCG [color="red"];
CTCG -> TTTT [color="blue"];
AA -> TTTT [color="red"];
}](../_images/graphviz-8bfc1ff64ffd40b9b874ff249905abc0a794d1a1.png)
Extract Subgraph: All segments and links that are part of these identified paths are collected into a new GFA file. This ensures that the local topology around the region of interest is preserved for all selected samples.
![// Pangenome Representation
digraph PangenomeGraph {
rankdir=LR; // Set graph direction to left-to-right
// Define nodes with labels representing the genome structure
GTTAA [label="GTTAA", style="filled,rounded"];
G2 [label="G", style="filled,rounded"];
GG [label="GG", style="filled,rounded"];
C2 [label="C", style="filled,rounded"];
GATCG [label="GATCG", style="filled,rounded"];
// Define edges showing relationships or paths between nodes
GTTAA -> G2 [color="blue"];
GTTAA -> GATCG [color="red"];
G2 -> C2 [color="red"];
G2 -> GG [color="blue"];
GG -> C2 [color="blue"];
C2 -> GATCG [color="blue"];
}](../_images/graphviz-4d844f4197b18b8420187643c1f81805e5050375.png)
Important Note: extract_subgraph does not cut segments. If a query coordinate falls in the middle of a segment, the entire segment is included in the subgraph. This design choice preserves the original segment IDs and graph topology.
The –merge-dist Option Explained
The --merge-dist (or -d) option is a powerful parameter that controls how gratools identifies relevant walks, especially in graphs derived from fragmented assemblies.
It corresponds to the -d parameter in bedtools merge.
Before extracting nodes, gratools first identifies all walk intervals for each sample. The –merge-dist option merges walk intervals from the same sample and chromosome if they are closer than the specified distance. This is very useful for capturing a complete locus even if it is represented by multiple, non-contiguous W-lines in the GFA.
Default (`-d 0` or if the default behavior is 0): Only intervals that overlap or are exactly adjacent (book-ended) are merged. Walks separated by even a small distance will not be combined.
Custom Value (`-d N` where N > 0): You can set a specific distance in base pairs. A large value can be used to link distant fragments of the same chromosome, treating them as a single region of interest for extraction.
Concrete Example:
Let’s consider a GFA where many samples have fragmented or repetitive walks. We will run an extraction with a query region on genomeA, chromosome genomeA_chr1, from position 45 to 100.
Command 1: Merge Distance = 0 (`-d 0`)
With a merge distance of 0, gratools will only merge walks that touch.
$ gratools extract_subgraph -g all_test.gfa --sample-query genomeA --chrom-query genomeA_chr1 \
--start-query 45 --stop-query 100 --all-samples -d 0
The resulting GFA file contains:
`gfa
H VN:Z:1.1 RS:Z:genomeA
...
S 10 GTTAA
S 11 CCGGT
...
S 20 ACGTT
...
W genomeA 0 genomeA_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20
W genomeC 0 genomeC_chr1_1 45 100 >10>11>12>13>14>15>16>17>18>19>20
W genomeC 0 genomeC_chr1_2 0 55 >10>11>12>13>14>15>16>17>18>19>20
W genomeE 0 genomeE_chr1 125 180 >10>11>12>13>14>15>16>17>18>19>20
W genomeE 0 genomeE_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20
...
`
- Observation: For genomeE, two separate walk (W) lines are present in the output. This is because both of genomeE’s paths (45-100 and 125-180) contain nodes from the query region (>10>…>20), but the paths themselves are treated as separate events because they are not adjacent. The subgraph only contains nodes 10 through 20.
Command 2: Merge Distance = 100 (`-d 100`)
With a merge distance of 100, gratools will treat walks from the same sample/chromosome that are within 100 bp of each other as a single large region.
$ gratools extract_subgraph -g all_test.gfa --sample-query genomeA --chrom-query genomeA_chr1 \
--start-query 45 --stop-query 100 --all-samples -d 100
The resulting GFA file now contains more nodes:
`gfa
H VN:Z:1.1 RS:Z:genomeA
...
S 10 GTTAA
...
S 20 ACGTT
S 21 TGCAT
...
S 25 AACAA
S 6 CCCCC
S 7 GGGGG
S 8 TTTTT
S 9 AACCG
...
W genomeA 0 genomeA_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20
W genomeE 0 genomeE_chr1 45 180 >10>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>10>11>12>13>14>15>16>17>18>19>20
...
`
- Observation: For genomeE, the two paths (45-100 and 125-180) are now merged into a single large path (45-180) because the 25 bp gap between them is less than the merge-dist of 100.
- Result: The subgraph is much larger. It now includes all nodes from the merged genomeE path, including nodes 21 through 25, which were not present in the first case. Similarly, the paths for genomeF were merged, bringing nodes 6, 7, 8, and 9 into the subgraph.
Conclusion: Using a –merge-dist > 0 is essential for reconstructing complete loci when assemblies are fragmented, ensuring that all parts of a region of interest are included in the extracted subgraph.