gratools extract_subgraph

Extract a valid, smaller GFA subgraph based on a genomic region of interest.

✂️ Subgraph Extraction

The extract_subgraph command allows you to isolate a specific genomic window from a massive pangenome graph. By defining a region on a query sample, GraTools identifies all paths (walks) traversing that area and extracts all relevant segments and links into a new, fully valid GFA file.

Options

🛠️ View Command Line Options

$ gratools extract_subgraph
Welcome to GraTools version: '1.1.0.dev7'
@author: GraTools team's
        ____                 __________               ____          
      6MMMMMb/               MMMMMMMMMM               `MM          
     8P    YM               /   MM     \               MM          
    6M      Y ___  __    ___    MM   _____     _____   MM   ____   
    MM        `MM 6MM  6MMMMb   MM  6MMMMMb   6MMMMMb  MM  6MMMMb\ 
    MM         MM69 " 8M'  `Mb  MM 6M'   `Mb 6M'   `Mb MM MM'    ` 
    MM     ___ MM'        ,oMM  MM MM     MM MM     MM MM YM.      
    MM     `M' MM     ,6MM9'MM  MM MM     MM MM     MM MM  YMMMMb  
    YM      M  MM     MM'   MM  MM MM     MM MM     MM MM      `Mb 
     8b    d9  MM     MM.  ,MM  MM YM.   ,M9 YM.   ,M9 MM L    ,MM 
      YMMMMM9  _MM_   `YMMM9'Yb_MM_ YMMMMM9   YMMMMM9 _MM_MYMMMM9 
        \                                    /                /
        /''A''\          /''''''\           /     /''''A'''''\
  ...GC|       |..ATG...C...CG...T....TAG..'..GC.|            |...
        \..C../      \.............../            \...TATA.../
 
Please cite our gitlab: https://forge.ird.fr/diade/gratools.git\

Usage: gratools extract_subgraph [OPTIONS]

  This command extracts a specific region from the GFA, defined by a query
  sample, chromosome, and start/end coordinates. The extracted subgraph,
  containing all paths traversing this region (for the query sample and any
  other specified samples), is saved as a new GFA file. Optionally, a
  corresponding FASTA file of the sequences in the subgraph can be generated.
  This command relies on a pre-existing GraTools index of the input GFA.
  
  For more details, see the full documentation:
  https://gratools.readthedocs.io/en/latest/commands/extract_subgraph.html

Extraction Query Options:
  -g, --gfa PATH
     Path to the input GFA file (e.g., myGraph.gfa or myGraph.gfa.gz).
     [required]

  -o, --outdir DIRECTORY
     Output directory for GraTools results. If not specified, results are
     typically placed in a subdirectory within the GFA file's parent directory
     (e.g., 'GraTools-output_<gfa_name>').

  -su, --suffix TEXT
     Custom suffix to append to output filenames. If not provided, a default
     suffix will be generated based on the command line parameters.

  -sq, --sample-query TEXT
     Name of the primary query sample to define the region.  [required]

  -chr, --chrom-query TEXT
     Name of the chromosome for the query region.  [required]

  -s, --start-query INTEGER RANGE
     Start position of the query region on the chromosome (0-based).  [default:
     0; x>=0]

  -e, --stop-query INTEGER RANGE
     Stop position of the query region on the chromosome (exclusive). Defaults
     to chromosome end if not provided.  [x>=1]

  -d, --merge-dist INTEGER RANGE
     Merge distance for 'bedtools merge -d'. If -1, uses 10% of query region
     length. 0 for abutting. See bedtools merge docs.  [default: -1; x>=-1]

  -sl, --samples-list FILE
     Path to a file listing additional sample names (one per line) to include in
     the extraction. Mutually exclusive with --all-samples.

  -as, --all-samples
     Include all samples from the GFA in the extraction (relative to the query
     region). Mutually exclusive with --samples-list.

Subgraph Specific Output Options:
  --build-fasta / --no-build-fasta
     Generate a FASTA file from the sequences within the extracted subgraph.
     [default: no-build-fasta]

Logging Options:
  -vv, --verbosity [DEBUG|INFO|ERROR]
     Set the logging verbosity level.  [default: INFO]

  -l, --log-path DIRECTORY
     Directory where the log files will be saved. If not specified, logs will be
     placed in the main output directory (or in a default GraTools log
     location).

Performance Options:
  -t, --threads INTEGER
     Number of threads to be used for parallelizable operations.  [default: 1]

Other options:
  -h, --help
     Show this message and exit.

Constraints:
  {--samples-list, --all-samples}
    The --samples-list and --all-samples options are mutually exclusive.If
    neither is provided, sequences might be extracted only for the query sample.

Usage Examples

1. Extract a Specific Region (All Samples)

This example extracts a subgraph corresponding to the region on CG14_Chr07 from 100,000 to 150,000, including the paths for all samples (–all-samples) that traverse this region.

$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --start-query 100000 --stop-query 150000 \
    --all-samples --threads 4

 ╭──────────────────────────────────────── Global Tracker ──────────────────────────────────────╮
 │ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 4/4 samples 0:00:04 0:00:00 │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────────── Samples Tracker ─────────────────────────────────────────────╮
 │ 'Og103': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 │ 'Og182': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 │ 'Og20': End processing    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 │ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
 01-13 14:25 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-100000-150000.gfa.gz

2. Subgraph + FASTA Generation

In addition to the GFA file, this command will also generate a FASTA file containing the sequences of all segments present in the extracted subgraph, using the –build-fasta flag.

$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --start-query 0 --stop-query 50000 \
    --build-fasta --all-samples

 ╭──────────────────────────────────── Global Tracker ──────────────────────────────────────────╮
 │ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 4/4 samples 0:00:17 0:00:00 │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────╯
 ╭───────────────────────────────────── Samples Tracker ────────────────────────────────────────────────╮
 │ 'Og103': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00  │
 │ 'Og182': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:08 0:00:00  │
 │ 'Og20': End processing    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:12 0:00:00  │
 │ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:17 0:00:00  │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
 01-13 14:55 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz
 01-13 14:55 |  INFO     | Generated FASTA file: 'Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.fasta'

⚠️ Note on FASTA Output

The output FASTA contains sequences from all samples present in the subgraph. For more specific FASTA extraction, consider using the gratools get_fasta subcommand.

3. Extract for a Specific List of Samples

This example extracts a subgraph for the specified region but only includes the paths for the samples listed in list_sample2extract.txt.

$ gratools extract_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --samples-list list_sample2extract.txt
 ╭─────────────────────────── Global Tracker ───────────────────────────────────────────────────╮
 │ Overall Progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 2/2 samples 0:00:04 0:00:00 │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────╯
 ╭─────────────────────────── Samples Tracker ─────────────────────────────────────────────────────────╮
 │ 'Og182': End processing   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
 │ 'Tog5681': End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 7/7 steps 0:00:04 0:00:00 │
 ╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
 01-13 15:12 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz

🛑 Filename Warning

The output filename is based on the query region, not the list of samples. Be aware that running this command with different sample lists but the same query region will overwrite previous results unless you specify a different output directory or suffix.

Illustrated Process

Define a Region: The --sample-query, --chrom-query, --start-query, and --stop-query options define a genomic interval using a specific sample’s coordinate system.

Identify Paths: The tool identifies all paths for the selected samples (query sample, all samples, or a list) that pass through this genomic interval. The blue path in the graph below represents the traversal for the query sample.

Fig 1: Identification of the query path (Blue).

Collect Elements: Every segment and link belonging to these walks is collected.

Fig 2: extracted subgraph.

Write GFA: A new valid GFA is produced.

💡 Topology Preservation

GraTools does not cut segments. If a coordinate falls inside a segment, the entire segment is included. This design choice preserves the original segment IDs and graph topology.

The `--merge-dist` Parameter

This option controls how GraTools handles fragmented assemblies or repetitive regions by merging nearby walk intervals.

🧊 No Merge (dist = 0)

Only intervals that overlap or are exactly adjacent (book-ended) are merged. Walks separated by even a small distance will not be combined.

Result: Fragmented W-lines remain separate. If a sample has two pieces of the same locus separated by a small gap, only the piece touching the query region is extracted.

🤖 Default (dist = 10% query length)

GraTools automatically calculates the merge distance as 10% of the query sequence length. This is the recommended setting for balanced results.

Result: Bridging logic is applied dynamically. Gaps proportional to the size of your query are bridged, ensuring a more continuous and biologically relevant subgraph without pulling in too much unrelated data.

🔗 Manual (dist = N bp)

You can set a specific distance in base pairs using an integer value.

Result: Full control for highly fragmented assemblies. A large value can be used to link distant fragments of the same chromosome, treating them as a single region of interest for extraction.

📖 Concrete Example: Walk Merging in Fragmented Assemblies

Context: We have a GFA where some samples are fragmented. Query: genomeA on chr1 from position 45 to 100. The Challenge: genomeE has a path that covers the region but is split by a 25bp gap.

—

🧊 Case 1: No Merging (-d 0)

Logic: Only walks overlapping the query are kept. The 25bp gap is treated as a boundary, resulting in two separate path entries for the same locus.

$ gratools extract_subgraph -g all_test.gfa \
    -s genomeA -c genomeA_chr1 \
    --start 45 --stop 100 -d 0

Resulting GFA Content:

H       VN:Z:1.1        RS:Z:genomeA
...
S       10      GTTAA
S       11      CCGGT
...
S       20      ACGTT
...
W       genomeA 0       genomeA_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20
W       genomeC 0       genomeC_chr1_1  45      100     >10>11>12>13>14>15>16>17>18>19>20
W       genomeC 0       genomeC_chr1_2  0       55      >10>11>12>13>14>15>16>17>18>19>20
W     genomeE 0       genomeE_chr1    125     180     >10>11>12>13>14>15>16>17>18>19>20
W     genomeE 0       genomeE_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20

🔗 Case 2: With Merging (-d 100)

Logic: Nearby intervals (<100bp) are merged. The gap is bridged, allowing GraTools to “pull” the intermediate segments into the subgraph.

$ gratools extract_subgraph -g all_test.gfa \
    -s genomeA -c genomeA_chr1 \
    --start 45 --stop 100 -d 100

Resulting GFA Content:

H       VN:Z:1.1        RS:Z:genomeA
...
S       10      GTTAA
...
S       20      ACGTT
S     21      TGCAT
...
S     25      AACAA
S     6       CCCCC
S     7       GGGGG
S     8       TTTTT
S     9       AACCG
...
W       genomeA 0       genomeA_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20
W     genomeE 0       genomeE_chr1    45      180     >10>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>10>11>12>13>14>15>16>17>18>19>20

—

💡 Observation Summary

In Case 1, Genome E is split. The subgraph only contains nodes 10 through 20.
In Case 2, GraTools bridges the 25bp gap because it is less than the merge-dist (100).
The Difference: The merged subgraph is larger and more complete. It captures the flanking segments (21 to 25) and repetitive nodes (6 to 9) that were skipped in the unmerged run.

—

📑 Quick Links

Command Index: gratools index
FASTA Specialist: gratools get_fasta
Graph Stats: gratools stats

gratools extract_subgraph

Options

Usage Examples

Illustrated Process

The --merge-dist Parameter

The `--merge-dist` Parameter