gratools extract_subgraph๏
Extract a valid, smaller GFA subgraph based on a genomic region of interest.
The extract_subgraph command allows you to isolate a specific genomic window from a massive pangenome graph. By defining a region on a query sample, GraTools identifies all paths (walks) traversing that area and extracts all relevant segments and links into a new, fully valid GFA file.
Options๏
๐ ๏ธ View Command Line Options
$ gratools extract_subgraph
Welcome to GraTools version: '1.1.0.dev7'
@author: GraTools team's
____ __________ ____
6MMMMMb/ MMMMMMMMMM `MM
8P YM / MM \ MM
6M Y ___ __ ___ MM _____ _____ MM ____
MM `MM 6MM 6MMMMb MM 6MMMMMb 6MMMMMb MM 6MMMMb\
MM MM69 " 8M' `Mb MM 6M' `Mb 6M' `Mb MM MM' `
MM ___ MM' ,oMM MM MM MM MM MM MM YM.
MM `M' MM ,6MM9'MM MM MM MM MM MM MM YMMMMb
YM M MM MM' MM MM MM MM MM MM MM `Mb
8b d9 MM MM. ,MM MM YM. ,M9 YM. ,M9 MM L ,MM
YMMMMM9 _MM_ `YMMM9'Yb_MM_ YMMMMM9 YMMMMM9 _MM_MYMMMM9
\ / /
/''A''\ /''''''\ / /''''A'''''\
...GC| |..ATG...C...CG...T....TAG..'..GC.| |...
\..C../ \.............../ \...TATA.../
Please cite our gitlab: https://forge.ird.fr/diade/gratools.git\
Usage: gratools extract_subgraph [OPTIONS]
This command extracts a specific region from the GFA, defined by a query
sample, chromosome, and start/end coordinates. The extracted subgraph,
containing all paths traversing this region (for the query sample and any
other specified samples), is saved as a new GFA file. Optionally, a
corresponding FASTA file of the sequences in the subgraph can be generated.
This command relies on a pre-existing GraTools index of the input GFA.
For more details, see the full documentation:
https://gratools.readthedocs.io/en/latest/commands/extract_subgraph.html
Extraction Query Options:
-g, --gfa PATH
Path to the input GFA file (e.g., myGraph.gfa or myGraph.gfa.gz).
[required]
-o, --outdir DIRECTORY
Output directory for GraTools results. If not specified, results are
typically placed in a subdirectory within the GFA file's parent directory
(e.g., 'GraTools-output_<gfa_name>').
-su, --suffix TEXT
Custom suffix to append to output filenames. If not provided, a default
suffix will be generated based on the command line parameters.
-sq, --sample-query TEXT
Name of the primary query sample to define the region. [required]
-chr, --chrom-query TEXT
Name of the chromosome for the query region. [required]
-s, --start-query INTEGER RANGE
Start position of the query region on the chromosome (0-based). [default:
0; x>=0]
-e, --stop-query INTEGER RANGE
Stop position of the query region on the chromosome (exclusive). Defaults
to chromosome end if not provided. [x>=1]
-d, --merge-dist INTEGER RANGE
Merge distance for 'bedtools merge -d'. If -1, uses 10% of query region
length. 0 for abutting. See bedtools merge docs. [default: -1; x>=-1]
-sl, --samples-list FILE
Path to a file listing additional sample names (one per line) to include in
the extraction. Mutually exclusive with --all-samples.
-as, --all-samples
Include all samples from the GFA in the extraction (relative to the query
region). Mutually exclusive with --samples-list.
Subgraph Specific Output Options:
--build-fasta / --no-build-fasta
Generate a FASTA file from the sequences within the extracted subgraph.
[default: no-build-fasta]
Logging Options:
-vv, --verbosity [DEBUG|INFO|ERROR]
Set the logging verbosity level. [default: INFO]
-l, --log-path DIRECTORY
Directory where the log files will be saved. If not specified, logs will be
placed in the main output directory (or in a default GraTools log
location).
Performance Options:
-t, --threads INTEGER
Number of threads to be used for parallelizable operations. [default: 1]
Other options:
-h, --help
Show this message and exit.
Constraints:
{--samples-list, --all-samples}
The --samples-list and --all-samples options are mutually exclusive.If
neither is provided, sequences might be extracted only for the query sample.
Usage Examples๏
This example extracts a subgraph corresponding to the region on
CG14_Chr07from 100,000 to 150,000, including the paths for all samples (โall-samples) that traverse this region.
$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 \
--start-query 100000 --stop-query 150000 \
--all-samples --threads 4
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Global Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Overall Progress โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 4/4 samples 0:00:04 0:00:00 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Samples Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ 'Og103': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โ 'Og182': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โ 'Og20': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โ 'Tog5681': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
01-13 14:25 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-100000-150000.gfa.gz
In addition to the GFA file, this command will also generate a FASTA file containing the sequences of all segments present in the extracted subgraph, using the โbuild-fasta flag.
$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \ --sample-query CG14 --chrom-query CG14_Chr07 \ --start-query 0 --stop-query 50000 \ --build-fasta --all-samples โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Global Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Overall Progress โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 4/4 samples 0:00:17 0:00:00 โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Samples Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ 'Og103': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ โ 'Og182': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:08 0:00:00 โ โ 'Og20': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:12 0:00:00 โ โ 'Tog5681': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:17 0:00:00 โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ 01-13 14:55 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz 01-13 14:55 | INFO | Generated FASTA file: 'Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.fasta'
The output FASTA contains sequences from all samples present in the subgraph. For more specific FASTA extraction, consider using the gratools get_fasta subcommand.
This example extracts a subgraph for the specified region but only includes the paths for the samples listed in list_sample2extract.txt.
$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 \
--samples-list list_sample2extract.txt
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโ Global Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Overall Progress โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 2/2 samples 0:00:04 0:00:00 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโ Samples Tracker โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ 'Og182': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โ 'Tog5681': End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100.0% 7/7 steps 0:00:04 0:00:00 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
01-13 15:12 | INFO | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz
The output filename is based on the query region, not the list of samples. Be aware that running this command with different sample lists but the same query region will overwrite previous results unless you specify a different output directory or suffix.
Illustrated Process๏
Define a Region: The
--sample-query,--chrom-query,--start-query, and--stop-queryoptions define a genomic interval using a specific sampleโs coordinate system.
Identify Paths: The tool identifies all paths for the selected samples (query sample, all samples, or a list) that pass through this genomic interval. The blue path in the graph below represents the traversal for the query sample.
Fig 1: Identification of the query path (Blue).
Collect Elements: Every segment and link belonging to these walks is collected.
Fig 2: extracted subgraph.
Write GFA: A new valid GFA is produced.
GraTools does not cut segments. If a coordinate falls inside a segment, the entire segment is included. This design choice preserves the original segment IDs and graph topology.
The --merge-dist Parameter๏
This option controls how GraTools handles fragmented assemblies or repetitive regions by merging nearby walk intervals.
Only intervals that overlap or are exactly adjacent (book-ended) are merged. Walks separated by even a small distance will not be combined.
Result: Fragmented W-lines remain separate. If a sample has two pieces of the same locus separated by a small gap, only the piece touching the query region is extracted.
GraTools automatically calculates the merge distance as 10% of the query sequence length. This is the recommended setting for balanced results.
Result: Bridging logic is applied dynamically. Gaps proportional to the size of your query are bridged, ensuring a more continuous and biologically relevant subgraph without pulling in too much unrelated data.
You can set a specific distance in base pairs using an integer value.
Result: Full control for highly fragmented assemblies. A large value can be used to link distant fragments of the same chromosome, treating them as a single region of interest for extraction.
Context: We have a GFA where some samples are fragmented.
Query: genomeA on chr1 from position 45 to 100.
The Challenge: genomeE has a path that covers the region but is split by a 25bp gap.
โ
Logic: Only walks overlapping the query are kept. The 25bp gap is treated as a boundary, resulting in two separate path entries for the same locus.
$ gratools extract_subgraph -g all_test.gfa \
-s genomeA -c genomeA_chr1 \
--start 45 --stop 100 -d 0
Resulting GFA Content:
H VN:Z:1.1 RS:Z:genomeA ... S 10 GTTAA S 11 CCGGT ... S 20 ACGTT ... W genomeA 0 genomeA_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20 W genomeC 0 genomeC_chr1_1 45 100 >10>11>12>13>14>15>16>17>18>19>20 W genomeC 0 genomeC_chr1_2 0 55 >10>11>12>13>14>15>16>17>18>19>20 W genomeE 0 genomeE_chr1 125 180 >10>11>12>13>14>15>16>17>18>19>20 W genomeE 0 genomeE_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20
Logic: Nearby intervals (<100bp) are merged. The gap is bridged, allowing GraTools to โpullโ the intermediate segments into the subgraph.
$ gratools extract_subgraph -g all_test.gfa \
-s genomeA -c genomeA_chr1 \
--start 45 --stop 100 -d 100
Resulting GFA Content:
H VN:Z:1.1 RS:Z:genomeA ... S 10 GTTAA ... S 20 ACGTT S 21 TGCAT ... S 25 AACAA S 6 CCCCC S 7 GGGGG S 8 TTTTT S 9 AACCG ... W genomeA 0 genomeA_chr1 45 100 >10>11>12>13>14>15>16>17>18>19>20 W genomeE 0 genomeE_chr1 45 180 >10>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>10>11>12>13>14>15>16>17>18>19>20
โ
In Case 1, Genome E is split. The subgraph only contains nodes 10 through 20.
In Case 2, GraTools bridges the 25bp gap because it is less than the
merge-dist(100).The Difference: The merged subgraph is larger and more complete. It captures the flanking segments (21 to 25) and repetitive nodes (6 to 9) that were skipped in the unmerged run.
โ
๐ Quick Links
Command Index: gratools index
FASTA Specialist: gratools get_fasta
Graph Stats: gratools stats