gratools get_subgraph๏
Extract a valid, smaller GFA subgraph based on a genomic region of interest.
The get_subgraph command allows you to isolate a specific genomic window from a massive pangenome graph. By defining a region on a query sample, GraTools identifies all paths (walks) traversing that area and extracts all relevant segments and links into a new, fully valid GFA file.
Options๏
๐ ๏ธ View Command Line Options
$ gratools get_subgraph
Welcome to GraTools version: '0.1.0.dev134'
@author: GraTools team's
____ __________ ____
6MMMMMb/ MMMMMMMMMM `MM
8P YM / MM \ MM
6M Y ___ __ ___ MM _____ _____ MM ____
MM `MM 6MM 6MMMMb MM 6MMMMMb 6MMMMMb MM 6MMMMb\
MM MM69 " 8M' `Mb MM 6M' `Mb 6M' `Mb MM MM' `
MM ___ MM' ,oMM MM MM MM MM MM MM YM.
MM `M' MM ,6MM9'MM MM MM MM MM MM MM YMMMMb
YM M MM MM' MM MM MM MM MM MM MM `Mb
8b d9 MM MM. ,MM MM YM. ,M9 YM. ,M9 MM L ,MM
YMMMMM9 _MM_ `YMMM9'Yb_MM_ YMMMMM9 YMMMMM9 _MM_MYMMMM9
\ / /
/''A''\ /''''''\ / /''''A'''''\
...GC| |..ATG...C...CG...T....TAG..'..GC.| |...
\..C../ \.............../ \...TATA.../
Please cite our gitlab: https://forge.ird.fr/diade/gratools.git\
Usage: gratools get_subgraph [OPTIONS]
Aliases: subgraph
This command extracts a specific region from the GFA, defined by a query
sample, chromosome, and start/end coordinates. The extracted subgraph,
containing all paths traversing this region (for the query sample and any
other specified samples), is saved as a new GFA file. Optionally, a
corresponding FASTA file of the sequences in the subgraph can be generated.
This command relies on a pre-existing GraTools import of the input GFA.
For more details, see the full documentation:
https://gratools.readthedocs.io/en/latest/commands/get_subgraph.html
Extraction Query Options:
-g, --gfa PATH
Path to the input GFA file (e.g., myGraph.gfa or myGraph.gfa.gz).
[required]
-o, --outdir DIRECTORY
Output directory for GraTools results. If not specified, results are
typically placed in a subdirectory within the GFA file's parent directory
(e.g., 'GraTools-output_<gfa_name>').
-su, --suffix TEXT
Custom suffix to append to output filenames. If not provided, a default
suffix will be generated based on the command line parameters.
-sq, --sample-query TEXT
Name of the primary query sample to define the region. [required]
-chr, --chrom-query TEXT
Name of the chromosome for the query region. [required]
-s, --start-query INTEGER RANGE
Start position of the query region on the chromosome (0-based). [default:
0; x>=0]
-e, --stop-query INTEGER RANGE
Stop position of the query region on the chromosome (exclusive). Defaults
to chromosome end if not provided. [x>=1]
-d, --merge-dist INTEGER RANGE
Merge distance for 'bedtools merge -d'. If -1, uses 10% of query region
length. 0 for abutting. See bedtools merge docs. [default: -1; x>=-1]
-sl, --samples-list FILE
Path to a file listing additional sample names (one per line) to include in
the extraction. Mutually exclusive with --all-samples.
-as, --all-samples
Include all samples from the GFA in the extraction (relative to the query
region). Mutually exclusive with --samples-list.
Subgraph Specific Output Options:
--build-fasta / --no-build-fasta
Generate a FASTA file from the sequences within the extracted subgraph.
[default: no-build-fasta]
Logging Options:
-vv, --verbosity [DEBUG|INFO|ERROR]
Set the logging verbosity level. [default: INFO]
-l, --log-path DIRECTORY
Directory where the log files will be saved. If not specified, logs will be
placed in the main output directory (or in a default GraTools log
location).
Performance Options:
-t, --threads INTEGER
Number of threads to be used for parallelizable operations. [default: 1]
Other options:
-h, --help
Show this message and exit.
Constraints:
{--samples-list, --all-samples}
The --samples-list and --all-samples options are mutually exclusive.If
neither is provided, sequences might be extracted only for the query
sample.
Usage Examples๏
This example extracts a subgraph corresponding to the region on
CG14_Chr07from 100,000 to 150,000, including the paths for all samples (โall-samples) that traverse this region.
$ gratools get_subgraph --gfa Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 \
--start-query 100000 --stop-query 150000 \
--all-samples --threads 4
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Global Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Overall Progress โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 4/4 samples 0:00:04 0:00:00 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Samples Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ 'Og103': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โ 'Og182': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โ 'Og20': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โ 'Tog5681': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
01-13 14:25 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-100000-150000.gfa.gz
In addition to the GFA file, this command will also generate a FASTA file containing the sequences of all segments present in the extracted subgraph, using the โbuild-fasta flag.
$ gratools get_subgraph --gfa Og_cactus.gfa.gz \ --sample-query CG14 --chrom-query CG14_Chr07 \ --start-query 0 --stop-query 50000 \ --build-fasta --all-samples โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Global Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Overall Progress โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 4/4 samples 0:00:17 0:00:00 โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Samples Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ 'Og103': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ โ 'Og182': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:08 0:00:00 โ โ 'Og20': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:12 0:00:00 โ โ 'Tog5681': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:17 0:00:00 โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ 01-13 14:55 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz 01-13 14:55 | INFO | Generated FASTA file: 'Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.fasta'
The output FASTA contains sequences from all samples present in the subgraph. For more specific FASTA extraction, consider using the gratools get_fasta subcommand.
This example extracts a subgraph for the specified region but only includes the paths for the samples listed in list_sample2extract.txt.
$ gratools get_subgraph --gfa Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 \
--samples-list list_sample2extract.txt
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโ Global Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Overall Progress โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 2/2 samples 0:00:04 0:00:00 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโ Samples Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ 'Og182': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โ 'Tog5681': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
01-13 15:12 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz
The output filename is based on the query region, not the list of samples. Be aware that running this command with different sample lists but the same query region will overwrite previous results unless you specify a different output directory or suffix.
Illustrated Process๏
Define a Region: The
--sample-query,--chrom-query,--start-query, and--stop-queryoptions define a genomic interval using a specific sampleโs coordinate system.
Identify Paths: The tool identifies all paths for the selected samples (query sample, all samples, or a list) that pass through this genomic interval. The blue path in the graph below represents the traversal for the query sample.
Fig 1: Identification of the query path (Blue).
Collect Elements: Every segment and link belonging to these walks is collected.
Fig 2: extracted subgraph.
Write GFA: A new valid GFA is produced.
The --merge-dist Parameter๏
This option controls how GraTools handles fragmented assemblies or repetitive regions by merging nearby walk intervals.
Initial GFA
This is the reference graph used in all examples below:
S 1 ACGTA
S 2 CGTAC
S 3 GTACG
S 4 TACGT
S 5 AAAAA
S 6 CCCCC
S 7 GGGGG
S 8 TTTTT
S 9 AACCG
S 10 GTTAA
S 11 CCGGT
S 12 TAACC
S 13 GGTTA
S 14 ACCGG
S 15 TTAAC
S 16 CGGTT
S 17 AACGT
S 18 ACCGT
S 19 ACGGT
S 20 ACGTT
S 21 TGCAT
S 22 GCATG
S 23 CATGC
S 24 ATGCA
S 25 AACAA
S 26 AAGAA
S 27 AATAA
S 28 CCACC
S 29 CCTCC
S 30 CCGCC
S 31 CTCTC
S 32 CTCTC
S 33 CTCTC
S 34 CTCTC
S 35 CTCTC
S 36 CTCTCCTCTC...
Walks across 4 genomes:
W genomeA 0 genomeA_chr1 0 150 >1>2>3>4>5>6>7>8>9>10>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W genomeB 0 genomeB_chr1 0 175 >1>2>3>4>5>6>7>8>9>11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W genomeC 0 genomeC_chr1 0 280 >1>2>3>4>5>6>7>8>9>10>11>12>13>14>15>30>36>31>32>33>34>35>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W genomeD 0 genomeD_chr1 0 145 >1>2>3>4>5>6>7>8>9>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
Full input graph structure. Query region (45โ100 bp) marked in red.๏
โ
Overview of merge-dist settings๏
Only intervals that overlap or are exactly adjacent (book-ended) are merged. Walks separated by even a small distance will not be combined.
Result: Fragmented W-lines remain separate. If a sample has walks covering your region but split by gaps, only pieces directly overlapping the query are extracted.
Use case: When you need strict coordinate-based extraction; your assembly is well-contiguous.
GraTools automatically calculates the merge distance as 100% of the query sequence length. This is the recommended setting for balanced, predictable results.
For 55bp query (45โ100): merge distance = 55bp
Result: Gaps proportional to the size of your query are bridged, ensuring more complete subgraphs that capture biologically related fragments without aggressive over-pulling.
Use case: Typical pangenome analysis with expected fragmentation; predictable scaling across different query sizes.
You can set a specific distance in base pairs using an integer value.
Result: Full control for highly fragmented assemblies. A large value can be used to link distant fragments of the same chromosome.
Use case: When you know the expected assembly gap size; highly fragmented genomes; custom filtering requirements.
โ
Concrete Example: Query Region 45โ100 (55bp span)๏
Parameter: -d 0 (no merge)
Command:
$ gratools get_subgraph -g all_test.gfa \
-s genomeA -c genomeA_chr1 \
--start 45 --stop 100 \
-d 0
Logic: Only walks overlapping the exact query coordinates are extracted. Nearby fragments are ignored, creating fragmented results.
Graph visualization:
With -d 0: Only core region extracted; gaps block fragment merging.๏
Extracted segments: 10โ20 (core only)
Resulting walks:
W genomeA 0 genomeA_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20 W genomeB 0 genomeB_chr1 45 70 >11>12>13>14>15 W genomeB 0 genomeB_chr1 100 125 >16>17>18>19>20 W genomeC 0 genomeC_chr1 45 75 >10>11>12>13>14>15 W genomeC 0 genomeC_chr1 205 230 >16>17>18>19>20 W genomeD 0 genomeD_chr1 45 95 >11>12>13>14>15>16>17>18>19>20
โ ๏ธ Problem:
genomeB is split into 2 separate walks (45โ70 and 100โ125)
genomeC is split into 2 separate walks (45โ75 and 205โ230)
The subgraph is fragmented, losing biological continuity
Parameter: -d 55 (default: 100% of 55bp query)
Command:
$ gratools get_subgraph -g all_test.gfa \
-s genomeA -c genomeA_chr1 \
--start 45 --stop 100
# Uses default -d 55 (100% of query length)
Logic: Nearby intervals separated by โค55bp are merged. Intermediate segments are pulled in, creating more complete paths while staying proportional to the query size.
Graph visualization:
With -d 55: Bridges 25bp gap; merges nearby fragments; includes repetitive region.๏
Extracted segments: 10โ20, 30โ35 (core + bridged regions)
Resulting walks:
W genomeA 0 genomeA_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20 W genomeB 0 genomeB_chr1 45 125 >11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20 W genomeC 0 genomeC_chr1 45 75 >10>11>12>13>14>15 W genomeC 0 genomeC_chr1 205 230 >16>17>18>19>20 W genomeD 0 genomeD_chr1 45 95 >11>12>13>14>15>16>17>18>19>20
โ Benefit:
genomeB is now merged into a single walk (45โ125) with biologically continuous path
Repetitive region (nodes 30โ35) is included
genomeC remains fragmented (gap > 55bp), which is correct
Balanced result: More complete without over-pulling
Parameter: -d 10000 (manual large value)
Command:
$ gratools get_subgraph -g all_test.gfa \
-s genomeA -c genomeA_chr1 \
--start 45 --stop 100 \
-d 10000
Logic: All walks within 10000bp are merged. This bridges even very distant fragments, useful for highly fragmented assemblies.
Graph visualization:
With -d 10000: Bridges all gaps; pulls in all reachable fragments including long-range repeats.๏
Extracted segments: 10โ20, 30โ36 (core + all reachable)
Resulting walks:
W genomeA 0 genomeA_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20 W genomeB 0 genomeB_chr1 45 125 >11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20 W genomeC 0 genomeC_chr1 45 230 >10>11>12>13>14>15>30>36>31>32>33>34>35>16>17>18>19>20 W genomeD 0 genomeD_chr1 45 95 >11>12>13>14>15>16>17>18>19>20
โ ๏ธ Trade-off:
genomeC is now merged (45โ230, previously fragmented)
Large tandem repeat (node 36) is included, significantly expanding subgraph
Biologically relevant for some loci, but may include unrelated structural variants
Better for highly fragmented genomes; larger outputs
โ
Observation Summary๏
Case 1 (-d 0):
Minimal but fragmented
genomeB & genomeC split across multiple walks
Best for: Strict coordinate extraction only
Case 2 (-d 55, default 100%):
RECOMMENDED for most analyses
Balances completeness with specificity
genomeB successfully merged (25bp gap < 55bp threshold)
Automatic scaling: works across different query sizes
genomeC correctly remains fragmented (gap > 55bp)
Case 3 (-d 10000):
Maximum connectivity
All genomes merged (even distant fragments)
Larger subgraphs with long-range repeats
Best for: Highly fragmented assemblies
โ
When to Use Each Setting๏
Use when:
Strict coordinate extraction only
Assembly is well-contiguous
Need to avoid ambiguity about โnearbyโ
Minimal graph size is critical
Testing or validation purposes
Use when:
Moderate fragmentation expected (typical pangenomes)
Want predictable, proportional behavior
Query sizes vary widely
Need biologically coherent subgraphs
No special requirements โ use this
Use when:
Know expected assembly gap size
Work with highly fragmented genomes
Enforce stricter filtering (small N)
Enforce aggressive merging (large N)
Have known characteristics of your data
โ
Impact on Subgraph Size๏
Query: 45โ100 (55bp span)
Parameter Segments Links Walks Complexity
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-d 0 11 10 5 (split) โ
โโ
-d 55 (default) 16 14 5 (merged) โ
โ
โ
-d 10000 17 15 5 (merged) โ
โ
โ
Note: Larger merge distances = larger subgraphs and potentially longer processing times. The default -d 100% strikes a balance.
โ
๐ Quick Links
Command Import: gratools import
FASTA Specialist: gratools get_fasta
Graph Stats: gratools stats