gratools get_subgraph

Extract a valid, smaller GFA subgraph based on a genomic region of interest.

✂️ Subgraph Extraction

The get_subgraph command allows you to isolate a specific genomic window from a massive pangenome graph. By defining a region on a query sample, GraTools identifies all paths (walks) traversing that area and extracts all relevant segments and links into a new, fully valid GFA file.

Options

🛠️ View Command Line Options

$ gratools get_subgraph
Welcome to GraTools version: '0.1.0.dev134'
@author: GraTools team's
        ____                 __________               ____          
      6MMMMMb/               MMMMMMMMMM               `MM          
     8P    YM               /   MM     \               MM          
    6M      Y ___  __    ___    MM   _____     _____   MM   ____   
    MM        `MM 6MM  6MMMMb   MM  6MMMMMb   6MMMMMb  MM  6MMMMb\ 
    MM         MM69 " 8M'  `Mb  MM 6M'   `Mb 6M'   `Mb MM MM'    ` 
    MM     ___ MM'        ,oMM  MM MM     MM MM     MM MM YM.      
    MM     `M' MM     ,6MM9'MM  MM MM     MM MM     MM MM  YMMMMb  
    YM      M  MM     MM'   MM  MM MM     MM MM     MM MM      `Mb 
     8b    d9  MM     MM.  ,MM  MM YM.   ,M9 YM.   ,M9 MM L    ,MM 
      YMMMMM9  _MM_   `YMMM9'Yb_MM_ YMMMMM9   YMMMMM9 _MM_MYMMMM9 
        \                                    /                /
        /''A''\          /''''''\           /     /''''A'''''\
  ...GC|       |..ATG...C...CG...T....TAG..'..GC.|            |...
        \..C../      \.............../            \...TATA.../
 
Please cite our gitlab: https://forge.ird.fr/diade/gratools.git\

Usage: gratools get_subgraph [OPTIONS]
Aliases: subgraph

  This command extracts a specific region from the GFA, defined by a query
  sample, chromosome, and start/end coordinates. The extracted subgraph,
  containing all paths traversing this region (for the query sample and any
  other specified samples), is saved as a new GFA file. Optionally, a
  corresponding FASTA file of the sequences in the subgraph can be generated.
  This command relies on a pre-existing GraTools import of the input GFA.
  
  For more details, see the full documentation:
  https://gratools.readthedocs.io/en/latest/commands/get_subgraph.html

Extraction Query Options:
  -g, --gfa PATH
     Path to the input GFA file (e.g., myGraph.gfa or myGraph.gfa.gz).
     [required]

  -o, --outdir DIRECTORY
     Output directory for GraTools results. If not specified, results are
     typically placed in a subdirectory within the GFA file's parent directory
     (e.g., 'GraTools-output_<gfa_name>').

  -su, --suffix TEXT
     Custom suffix to append to output filenames. If not provided, a default
     suffix will be generated based on the command line parameters.

  -sq, --sample-query TEXT
     Name of the primary query sample to define the region.  [required]

  -chr, --chrom-query TEXT
     Name of the chromosome for the query region.  [required]

  -s, --start-query INTEGER RANGE
     Start position of the query region on the chromosome (0-based).  [default:
     0; x>=0]

  -e, --stop-query INTEGER RANGE
     Stop position of the query region on the chromosome (exclusive). Defaults
     to chromosome end if not provided.  [x>=1]

  -d, --merge-dist INTEGER RANGE
     Merge distance for 'bedtools merge -d'. If -1, uses 10% of query region
     length. 0 for abutting. See bedtools merge docs.  [default: -1; x>=-1]

  -sl, --samples-list FILE
     Path to a file listing additional sample names (one per line) to include in
     the extraction. Mutually exclusive with --all-samples.

  -as, --all-samples
     Include all samples from the GFA in the extraction (relative to the query
     region). Mutually exclusive with --samples-list.

Subgraph Specific Output Options:
  --build-fasta / --no-build-fasta
     Generate a FASTA file from the sequences within the extracted subgraph.
     [default: no-build-fasta]

Logging Options:
  -vv, --verbosity [DEBUG|INFO|ERROR]
     Set the logging verbosity level.  [default: INFO]

  -l, --log-path DIRECTORY
     Directory where the log files will be saved. If not specified, logs will be
     placed in the main output directory (or in a default GraTools log
     location).

Performance Options:
  -t, --threads INTEGER
     Number of threads to be used for parallelizable operations.  [default: 1]

Other options:
  -h, --help
     Show this message and exit.

Constraints:
  {--samples-list, --all-samples}
     The --samples-list and --all-samples options are mutually exclusive.If
     neither is provided, sequences might be extracted only for the query
     sample.

Usage Examples

1. Extract a Specific Region (All Samples)

This example extracts a subgraph corresponding to the region on CG14_Chr07 from 100,000 to 150,000, including the paths for all samples (–all-samples) that traverse this region.

$ gratools get_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --start-query 100000 --stop-query 150000 \
    --all-samples --threads 4

 ╭──────────────────────────────────────── Global Tracker ──────────────────────────────────────╮
 │ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 4/4 samples 0:00:04 0:00:00 │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────────── Samples Tracker ─────────────────────────────────────────────╮
 │ 'Og103': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 │ 'Og182': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 │ 'Og20': End processing    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 │ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
 01-13 14:25 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-100000-150000.gfa.gz

2. Subgraph + FASTA Generation

In addition to the GFA file, this command will also generate a FASTA file containing the sequences of all segments present in the extracted subgraph, using the –build-fasta flag.

$ gratools get_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --start-query 0 --stop-query 50000 \
    --build-fasta --all-samples

 ╭──────────────────────────────────── Global Tracker ──────────────────────────────────────────╮
 │ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 4/4 samples 0:00:17 0:00:00 │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────╯
 ╭───────────────────────────────────── Samples Tracker ────────────────────────────────────────────────╮
 │ 'Og103': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 │ 'Og182': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:08 0:00:00  │
 │ 'Og20': End processing    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:12 0:00:00  │
 │ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:17 0:00:00  │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
 01-13 14:55 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz
 01-13 14:55 |  INFO     | Generated FASTA file: 'Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.fasta'

⚠️ Note on FASTA Output

The output FASTA contains sequences from all samples present in the subgraph. For more specific FASTA extraction, consider using the gratools get_fasta subcommand.

3. Extract for a Specific List of Samples

This example extracts a subgraph for the specified region but only includes the paths for the samples listed in list_sample2extract.txt.

$ gratools get_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --samples-list list_sample2extract.txt
 ╭─────────────────────────── Global Tracker ───────────────────────────────────────────────────╮
 │ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 2/2 samples 0:00:04 0:00:00 │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────╯
 ╭─────────────────────────── Samples Tracker ─────────────────────────────────────────────────────────╮
 │ 'Og182': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
 │ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
 ╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
 01-13 15:12 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz

🛑 Filename Warning

The output filename is based on the query region, not the list of samples. Be aware that running this command with different sample lists but the same query region will overwrite previous results unless you specify a different output directory or suffix.

Illustrated Process

Define a Region: The --sample-query, --chrom-query, --start-query, and --stop-query options define a genomic interval using a specific sample’s coordinate system.

Identify Paths: The tool identifies all paths for the selected samples (query sample, all samples, or a list) that pass through this genomic interval. The blue path in the graph below represents the traversal for the query sample.

Fig 1: Identification of the query path (Blue).

Collect Elements: Every segment and link belonging to these walks is collected.

Fig 2: extracted subgraph.

Write GFA: A new valid GFA is produced.

The `--merge-dist` Parameter

This option controls how GraTools handles fragmented assemblies or repetitive regions by merging nearby walk intervals.

📊 Input Graph Structure

Initial GFA

This is the reference graph used in all examples below:

S     1       ACGTA
S     2       CGTAC
S     3       GTACG
S     4       TACGT
S     5       AAAAA
S     6       CCCCC
S     7       GGGGG
S     8       TTTTT
S     9       AACCG
S     10      GTTAA
S     11      CCGGT
S     12      TAACC
S     13      GGTTA
S     14      ACCGG
S     15      TTAAC
S     16      CGGTT
S     17      AACGT
S     18      ACCGT
S     19      ACGGT
S     20      ACGTT
S     21      TGCAT
S     22      GCATG
S     23      CATGC
S     24      ATGCA
S     25      AACAA
S     26      AAGAA
S     27      AATAA
S     28      CCACC
S     29      CCTCC
S     30      CCGCC
S     31      CTCTC
S     32      CTCTC
S     33      CTCTC
S     34      CTCTC
S     35      CTCTC
S     36      CTCTCCTCTC...

Walks across 4 genomes:

W     genomeA 0       genomeA_chr1    0       150     >1>2>3>4>5>6>7>8>9>10>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W     genomeB 0       genomeB_chr1    0       175     >1>2>3>4>5>6>7>8>9>11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W     genomeC 0       genomeC_chr1    0       280     >1>2>3>4>5>6>7>8>9>10>11>12>13>14>15>30>36>31>32>33>34>35>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W     genomeD 0       genomeD_chr1    0       145     >1>2>3>4>5>6>7>8>9>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30

Full input graph structure. Query region (45–100 bp) marked in red.

—

Overview of merge-dist settings

🧊 No Merge (dist = 0)

Only intervals that overlap or are exactly adjacent (book-ended) are merged. Walks separated by even a small distance will not be combined.

Result: Fragmented W-lines remain separate. If a sample has walks covering your region but split by gaps, only pieces directly overlapping the query are extracted.

Use case: When you need strict coordinate-based extraction; your assembly is well-contiguous.

🔗 Default (dist = 100% query length)

GraTools automatically calculates the merge distance as 100% of the query sequence length. This is the recommended setting for balanced, predictable results.

For 55bp query (45–100): merge distance = 55bp

Result: Gaps proportional to the size of your query are bridged, ensuring more complete subgraphs that capture biologically related fragments without aggressive over-pulling.

Use case: Typical pangenome analysis with expected fragmentation; predictable scaling across different query sizes.

🤖 Manual (dist = N bp)

You can set a specific distance in base pairs using an integer value.

Result: Full control for highly fragmented assemblies. A large value can be used to link distant fragments of the same chromosome.

Use case: When you know the expected assembly gap size; highly fragmented genomes; custom filtering requirements.

—

Concrete Example: Query Region 45–100 (55bp span)

🧊 Case 1: No Merging (-d 0)

Parameter: -d 0 (no merge)

Command:

$ gratools get_subgraph -g all_test.gfa \
    -s genomeA -c genomeA_chr1 \
    --start 45 --stop 100 \
    -d 0

Logic: Only walks overlapping the exact query coordinates are extracted. Nearby fragments are ignored, creating fragmented results.

Graph visualization:

With -d 0: Only core region extracted; gaps block fragment merging.

Extracted segments: 10–20 (core only)

Resulting walks:

W    genomeA 0       genomeA_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20
W  genomeB 0       genomeB_chr1    45      70      >11>12>13>14>15
W  genomeB 0       genomeB_chr1    100     125     >16>17>18>19>20
W    genomeC 0       genomeC_chr1    45      75      >10>11>12>13>14>15
W  genomeC 0       genomeC_chr1    205     230     >16>17>18>19>20
W    genomeD 0       genomeD_chr1    45      95      >11>12>13>14>15>16>17>18>19>20

⚠️ Problem:

genomeB is split into 2 separate walks (45–70 and 100–125)
genomeC is split into 2 separate walks (45–75 and 205–230)
The subgraph is fragmented, losing biological continuity

🔗 Case 2: Default Merging (-d 55, 100% of query)

Parameter: -d 55 (default: 100% of 55bp query)

Command:

$ gratools get_subgraph -g all_test.gfa \
    -s genomeA -c genomeA_chr1 \
    --start 45 --stop 100
    # Uses default -d 55 (100% of query length)

Logic: Nearby intervals separated by ≤55bp are merged. Intermediate segments are pulled in, creating more complete paths while staying proportional to the query size.

Graph visualization:

With -d 55: Bridges 25bp gap; merges nearby fragments; includes repetitive region.

Extracted segments: 10–20, 30–35 (core + bridged regions)

Resulting walks:

W    genomeA 0       genomeA_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20
W  genomeB 0       genomeB_chr1    45      125     >11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20
W    genomeC 0       genomeC_chr1    45      75      >10>11>12>13>14>15
W  genomeC 0       genomeC_chr1    205     230     >16>17>18>19>20
W    genomeD 0       genomeD_chr1    45      95      >11>12>13>14>15>16>17>18>19>20

✅ Benefit:

genomeB is now merged into a single walk (45–125) with biologically continuous path
Repetitive region (nodes 30–35) is included
genomeC remains fragmented (gap > 55bp), which is correct
Balanced result: More complete without over-pulling

🌍 Case 3: Aggressive Merging (-d 10000)

Parameter: -d 10000 (manual large value)

Command:

$ gratools get_subgraph -g all_test.gfa \
    -s genomeA -c genomeA_chr1 \
    --start 45 --stop 100 \
    -d 10000

Logic: All walks within 10000bp are merged. This bridges even very distant fragments, useful for highly fragmented assemblies.

Graph visualization:

With -d 10000: Bridges all gaps; pulls in all reachable fragments including long-range repeats.

Extracted segments: 10–20, 30–36 (core + all reachable)

Resulting walks:

W    genomeA 0       genomeA_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20
W  genomeB 0       genomeB_chr1    45      125     >11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20
W  genomeC 0       genomeC_chr1    45      230     >10>11>12>13>14>15>30>36>31>32>33>34>35>16>17>18>19>20
W    genomeD 0       genomeD_chr1    45      95      >11>12>13>14>15>16>17>18>19>20

⚠️ Trade-off:

genomeC is now merged (45–230, previously fragmented)
Large tandem repeat (node 36) is included, significantly expanding subgraph
Biologically relevant for some loci, but may include unrelated structural variants
Better for highly fragmented genomes; larger outputs

—

Observation Summary

💡 Key Findings

Case 1 (-d 0):

Minimal but fragmented
genomeB & genomeC split across multiple walks
Best for: Strict coordinate extraction only

Case 2 (-d 55, default 100%):

RECOMMENDED for most analyses
Balances completeness with specificity
genomeB successfully merged (25bp gap < 55bp threshold)
Automatic scaling: works across different query sizes
genomeC correctly remains fragmented (gap > 55bp)

Case 3 (-d 10000):

Maximum connectivity
All genomes merged (even distant fragments)
Larger subgraphs with long-range repeats
Best for: Highly fragmented assemblies

—

When to Use Each Setting

🧊 No Merge (-d 0)

Use when:

Strict coordinate extraction only
Assembly is well-contiguous
Need to avoid ambiguity about “nearby”
Minimal graph size is critical
Testing or validation purposes

🔗 Default (-d 100%)

Use when:

Moderate fragmentation expected (typical pangenomes)
Want predictable, proportional behavior
Query sizes vary widely
Need biologically coherent subgraphs
No special requirements → use this

🤖 Manual (-d N bp)

Use when:

Know expected assembly gap size
Work with highly fragmented genomes
Enforce stricter filtering (small N)
Enforce aggressive merging (large N)
Have known characteristics of your data

—

Impact on Subgraph Size

Query: 45–100 (55bp span)

Parameter       Segments  Links  Walks     Complexity
────────────────────────────────────────────────────
-d 0            11        10     5 (split)  ★☆☆
-d 55 (default) 16        14     5 (merged) ★★☆
-d 10000        17        15     5 (merged) ★★★

Note: Larger merge distances = larger subgraphs and potentially longer processing times. The default -d 100% strikes a balance.

—

📑 Quick Links

Command Import: gratools import
FASTA Specialist: gratools get_fasta
Graph Stats: gratools stats

gratools get_subgraph

Options

Usage Examples

Illustrated Process

The --merge-dist Parameter

Overview of merge-dist settings

Concrete Example: Query Region 45–100 (55bp span)

Observation Summary

When to Use Each Setting

Impact on Subgraph Size

The `--merge-dist` Parameter