gratools get_subgraph๏ƒ

Extract a valid, smaller GFA subgraph based on a genomic region of interest.

โœ‚๏ธ Subgraph Extraction

The get_subgraph command allows you to isolate a specific genomic window from a massive pangenome graph. By defining a region on a query sample, GraTools identifies all paths (walks) traversing that area and extracts all relevant segments and links into a new, fully valid GFA file.

Options๏ƒ

๐Ÿ› ๏ธ View Command Line Options
$ gratools get_subgraph
Welcome to GraTools version: '0.1.0.dev134'
@author: GraTools team's
        ____                 __________               ____          
      6MMMMMb/               MMMMMMMMMM               `MM          
     8P    YM               /   MM     \               MM          
    6M      Y ___  __    ___    MM   _____     _____   MM   ____   
    MM        `MM 6MM  6MMMMb   MM  6MMMMMb   6MMMMMb  MM  6MMMMb\ 
    MM         MM69 " 8M'  `Mb  MM 6M'   `Mb 6M'   `Mb MM MM'    ` 
    MM     ___ MM'        ,oMM  MM MM     MM MM     MM MM YM.      
    MM     `M' MM     ,6MM9'MM  MM MM     MM MM     MM MM  YMMMMb  
    YM      M  MM     MM'   MM  MM MM     MM MM     MM MM      `Mb 
     8b    d9  MM     MM.  ,MM  MM YM.   ,M9 YM.   ,M9 MM L    ,MM 
      YMMMMM9  _MM_   `YMMM9'Yb_MM_ YMMMMM9   YMMMMM9 _MM_MYMMMM9 
        \                                    /                /
        /''A''\          /''''''\           /     /''''A'''''\
  ...GC|       |..ATG...C...CG...T....TAG..'..GC.|            |...
        \..C../      \.............../            \...TATA.../
 
Please cite our gitlab: https://forge.ird.fr/diade/gratools.git\

Usage: gratools get_subgraph [OPTIONS]
Aliases: subgraph

  This command extracts a specific region from the GFA, defined by a query
  sample, chromosome, and start/end coordinates. The extracted subgraph,
  containing all paths traversing this region (for the query sample and any
  other specified samples), is saved as a new GFA file. Optionally, a
  corresponding FASTA file of the sequences in the subgraph can be generated.
  This command relies on a pre-existing GraTools import of the input GFA.
  
  For more details, see the full documentation:
  https://gratools.readthedocs.io/en/latest/commands/get_subgraph.html

Extraction Query Options:
  -g, --gfa PATH
     Path to the input GFA file (e.g., myGraph.gfa or myGraph.gfa.gz).
     [required]

  -o, --outdir DIRECTORY
     Output directory for GraTools results. If not specified, results are
     typically placed in a subdirectory within the GFA file's parent directory
     (e.g., 'GraTools-output_<gfa_name>').

  -su, --suffix TEXT
     Custom suffix to append to output filenames. If not provided, a default
     suffix will be generated based on the command line parameters.

  -sq, --sample-query TEXT
     Name of the primary query sample to define the region.  [required]

  -chr, --chrom-query TEXT
     Name of the chromosome for the query region.  [required]

  -s, --start-query INTEGER RANGE
     Start position of the query region on the chromosome (0-based).  [default:
     0; x>=0]

  -e, --stop-query INTEGER RANGE
     Stop position of the query region on the chromosome (exclusive). Defaults
     to chromosome end if not provided.  [x>=1]

  -d, --merge-dist INTEGER RANGE
     Merge distance for 'bedtools merge -d'. If -1, uses 10% of query region
     length. 0 for abutting. See bedtools merge docs.  [default: -1; x>=-1]

  -sl, --samples-list FILE
     Path to a file listing additional sample names (one per line) to include in
     the extraction. Mutually exclusive with --all-samples.

  -as, --all-samples
     Include all samples from the GFA in the extraction (relative to the query
     region). Mutually exclusive with --samples-list.

Subgraph Specific Output Options:
  --build-fasta / --no-build-fasta
     Generate a FASTA file from the sequences within the extracted subgraph.
     [default: no-build-fasta]

Logging Options:
  -vv, --verbosity [DEBUG|INFO|ERROR]
     Set the logging verbosity level.  [default: INFO]

  -l, --log-path DIRECTORY
     Directory where the log files will be saved. If not specified, logs will be
     placed in the main output directory (or in a default GraTools log
     location).

Performance Options:
  -t, --threads INTEGER
     Number of threads to be used for parallelizable operations.  [default: 1]

Other options:
  -h, --help
     Show this message and exit.

Constraints:
  {--samples-list, --all-samples}
     The --samples-list and --all-samples options are mutually exclusive.If
     neither is provided, sequences might be extracted only for the query
     sample.

Usage Examples๏ƒ

1. Extract a Specific Region (All Samples)

This example extracts a subgraph corresponding to the region on CG14_Chr07 from 100,000 to 150,000, including the paths for all samples (โ€“all-samples) that traverse this region.

$ gratools get_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --start-query 100000 --stop-query 150000 \
    --all-samples --threads 4

 โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Global Tracker โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
 โ”‚ Overall Progress โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 4/4 samples 0:00:04 0:00:00 โ”‚
 โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
 โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Samples Tracker โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
 โ”‚ 'Og103': End processing   โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:04 0:00:00  โ”‚
 โ”‚ 'Og182': End processing   โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:04 0:00:00  โ”‚
 โ”‚ 'Og20': End processing    โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:04 0:00:00  โ”‚
 โ”‚ 'Tog5681': End processing โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:04 0:00:00  โ”‚
 โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
 01-13 14:25 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-100000-150000.gfa.gz
2. Subgraph + FASTA Generation

In addition to the GFA file, this command will also generate a FASTA file containing the sequences of all segments present in the extracted subgraph, using the โ€“build-fasta flag.

$ gratools get_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --start-query 0 --stop-query 50000 \
    --build-fasta --all-samples

 โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Global Tracker โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
 โ”‚ Overall Progress โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 4/4 samples 0:00:17 0:00:00 โ”‚
 โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
 โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Samples Tracker โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
 โ”‚ 'Og103': End processing   โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:04 0:00:00  โ”‚
 โ”‚ 'Og182': End processing   โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:08 0:00:00  โ”‚
 โ”‚ 'Og20': End processing    โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:12 0:00:00  โ”‚
 โ”‚ 'Tog5681': End processing โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:17 0:00:00  โ”‚
 โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
 01-13 14:55 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz
 01-13 14:55 |  INFO     | Generated FASTA file: 'Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.fasta'
โš ๏ธ Note on FASTA Output

The output FASTA contains sequences from all samples present in the subgraph. For more specific FASTA extraction, consider using the gratools get_fasta subcommand.

3. Extract for a Specific List of Samples

This example extracts a subgraph for the specified region but only includes the paths for the samples listed in list_sample2extract.txt.

$ gratools get_subgraph --gfa Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --samples-list list_sample2extract.txt
 โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Global Tracker โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
 โ”‚ Overall Progress โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 2/2 samples 0:00:04 0:00:00 โ”‚
 โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
 โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Samples Tracker โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
 โ”‚ 'Og182': End processing   โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:04 0:00:00 โ”‚
 โ”‚ 'Tog5681': End processing โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% 7/7 steps 0:00:04 0:00:00 โ”‚
 โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
 01-13 15:12 |  INFO     | Generated GFA file Og_cactus_subgraph_CG14-CG14_Chr07-0-50000.gfa.gz
๐Ÿ›‘ Filename Warning

The output filename is based on the query region, not the list of samples. Be aware that running this command with different sample lists but the same query region will overwrite previous results unless you specify a different output directory or suffix.

Illustrated Process๏ƒ

  1. Define a Region: The --sample-query, --chrom-query, --start-query, and --stop-query options define a genomic interval using a specific sampleโ€™s coordinate system.

  1. Identify Paths: The tool identifies all paths for the selected samples (query sample, all samples, or a list) that pass through this genomic interval. The blue path in the graph below represents the traversal for the query sample.

// Pangenome Representation digraph PangenomeGraph { rankdir=LR; // Set graph direction to left-to-right // Define nodes with labels representing the genome structure AC [label="AC"]; C1 [label="C"]; G1 [label="G"]; GTTAA [label=<G<b>T</b>TAA>, style="filled,rounded"]; G2 [label="G", style="filled,rounded"]; GG [label="GG", style="filled,rounded"]; C2 [label="C", style="filled,rounded"]; GATCG [label=<GA<b>T</b>CG>, style="filled,rounded"]; CTCG [label="CTCG"]; AA [label="AA"]; TTTT [label="TTTT"]; // Define edges showing relationships or paths between nodes AC -> G1 [color="blue"]; AC -> C1 [color="red"]; G1 -> GTTAA [color="blue"]; C1 -> GTTAA [color="red"]; GTTAA -> G2 [color="blue"]; GTTAA -> GATCG [color="red"]; G2 -> C2 [color="red"]; G2 -> GG [color="blue"]; GG -> C2 [color="blue"]; C2 -> GATCG [color="blue"]; GATCG -> CTCG [color="blue"]; GATCG -> AA [color="red"]; AA -> CTCG [color="red"]; CTCG -> TTTT [color="blue"]; AA -> TTTT [color="red"]; }

Fig 1: Identification of the query path (Blue).

  1. Collect Elements: Every segment and link belonging to these walks is collected.

// Pangenome Representation digraph PangenomeGraph { rankdir=LR; // Set graph direction to left-to-right // Define nodes with labels representing the genome structure GTTAA [label="GTTAA", style="filled,rounded"]; G2 [label="G", style="filled,rounded"]; GG [label="GG", style="filled,rounded"]; C2 [label="C", style="filled,rounded"]; GATCG [label="GATCG", style="filled,rounded"]; // Define edges showing relationships or paths between nodes GTTAA -> G2 [color="blue"]; GTTAA -> GATCG [color="red"]; G2 -> C2 [color="red"]; G2 -> GG [color="blue"]; GG -> C2 [color="blue"]; C2 -> GATCG [color="blue"]; }

Fig 2: extracted subgraph.

  1. Write GFA: A new valid GFA is produced.

The --merge-dist Parameter๏ƒ

This option controls how GraTools handles fragmented assemblies or repetitive regions by merging nearby walk intervals.

๐Ÿ“Š Input Graph Structure

Initial GFA

This is the reference graph used in all examples below:

S     1       ACGTA
S     2       CGTAC
S     3       GTACG
S     4       TACGT
S     5       AAAAA
S     6       CCCCC
S     7       GGGGG
S     8       TTTTT
S     9       AACCG
S     10      GTTAA
S     11      CCGGT
S     12      TAACC
S     13      GGTTA
S     14      ACCGG
S     15      TTAAC
S     16      CGGTT
S     17      AACGT
S     18      ACCGT
S     19      ACGGT
S     20      ACGTT
S     21      TGCAT
S     22      GCATG
S     23      CATGC
S     24      ATGCA
S     25      AACAA
S     26      AAGAA
S     27      AATAA
S     28      CCACC
S     29      CCTCC
S     30      CCGCC
S     31      CTCTC
S     32      CTCTC
S     33      CTCTC
S     34      CTCTC
S     35      CTCTC
S     36      CTCTCCTCTC...

Walks across 4 genomes:

W     genomeA 0       genomeA_chr1    0       150     >1>2>3>4>5>6>7>8>9>10>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W     genomeB 0       genomeB_chr1    0       175     >1>2>3>4>5>6>7>8>9>11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W     genomeC 0       genomeC_chr1    0       280     >1>2>3>4>5>6>7>8>9>10>11>12>13>14>15>30>36>31>32>33>34>35>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30
W     genomeD 0       genomeD_chr1    0       145     >1>2>3>4>5>6>7>8>9>11>12>13>14>15>16>17>18>19>20>21>22>23>24>25>26>27>28>29>30

digraph { graph [rankdir=LR, bgcolor="transparent", size="14,6"]; node [shape=box, style=filled, fontname="Courier", fontsize=9]; // Core region (query: 45-100) subgraph cluster_core { label="Core Region (Query: 45โ€“100 bp)"; color="#e35025"; style=bold; 10 [fillcolor="#e35025", fontcolor=white, label="10\nGTTAA"]; 11 [fillcolor="#e35025", fontcolor=white]; 12 [fillcolor="#e35025", fontcolor=white]; 13 [fillcolor="#e35025", fontcolor=white]; 14 [fillcolor="#e35025", fontcolor=white]; 15 [fillcolor="#e35025", fontcolor=white]; 16 [fillcolor="#e35025", fontcolor=white]; 17 [fillcolor="#e35025", fontcolor=white]; 18 [fillcolor="#e35025", fontcolor=white]; 19 [fillcolor="#e35025", fontcolor=white]; 20 [fillcolor="#e35025", fontcolor=white, label="20\nACGTT"]; } // Repetitive region (nearby) subgraph cluster_repeat { label="Repetitive Region (Nearby: 30โ€“35)"; color="#f6cc51"; style=bold; 30 [fillcolor="#f6cc51", label="30\nCCGCC"]; 31 [fillcolor="#f6cc51"]; 32 [fillcolor="#f6cc51"]; 33 [fillcolor="#f6cc51"]; 34 [fillcolor="#f6cc51"]; 35 [fillcolor="#f6cc51"]; } // Long-range repetitive (distant) 36 [shape=box, style=filled, fillcolor="#90EE90", label="36\n(Large tandem)"]; // Connections 10 -> 11 -> 12 -> 13 -> 14 -> 15 -> 16 -> 17 -> 18 -> 19 -> 20; 15 -> 30 -> 31 -> 32 -> 33 -> 34 -> 35 -> 16; 30 -> 36; 36 -> 31; }

Full input graph structure. Query region (45โ€“100 bp) marked in red.๏ƒ

โ€”

Overview of merge-dist settings๏ƒ

Only intervals that overlap or are exactly adjacent (book-ended) are merged. Walks separated by even a small distance will not be combined.

Result: Fragmented W-lines remain separate. If a sample has walks covering your region but split by gaps, only pieces directly overlapping the query are extracted.

Use case: When you need strict coordinate-based extraction; your assembly is well-contiguous.

GraTools automatically calculates the merge distance as 100% of the query sequence length. This is the recommended setting for balanced, predictable results.

For 55bp query (45โ€“100): merge distance = 55bp

Result: Gaps proportional to the size of your query are bridged, ensuring more complete subgraphs that capture biologically related fragments without aggressive over-pulling.

Use case: Typical pangenome analysis with expected fragmentation; predictable scaling across different query sizes.

You can set a specific distance in base pairs using an integer value.

Result: Full control for highly fragmented assemblies. A large value can be used to link distant fragments of the same chromosome.

Use case: When you know the expected assembly gap size; highly fragmented genomes; custom filtering requirements.

โ€”

Concrete Example: Query Region 45โ€“100 (55bp span)๏ƒ

Parameter: -d 0 (no merge)

Command:

$ gratools get_subgraph -g all_test.gfa \
    -s genomeA -c genomeA_chr1 \
    --start 45 --stop 100 \
    -d 0

Logic: Only walks overlapping the exact query coordinates are extracted. Nearby fragments are ignored, creating fragmented results.

Graph visualization:

digraph { graph [rankdir=LR, bgcolor="transparent"]; node [shape=box, style=filled, fontname="Courier", fontsize=10]; subgraph cluster_query { label="Extracted: Query Region Only"; color="#e35025"; style=bold; Q1 [label="Query\n(45โ€“100)", shape=invhouse, fillcolor="#e35025", fontcolor=white]; C10 [fillcolor="#e35025", fontcolor=white, label="10โ€“20"]; } subgraph cluster_ignore { label="Ignored: Nearby Fragments"; color=grey; style=dotted; I30 [fillcolor=lightgrey, label="30โ€“35\n(25bp gap)"]; I36 [fillcolor=lightgrey, label="36\n(far)"]; } Q1 -> C10; C10 -> I30 [style=dotted, color=red, label="Gap โ‰ฅ 25bp"]; I30 -> I36 [style=dotted, color=red]; }

With -d 0: Only core region extracted; gaps block fragment merging.๏ƒ

Extracted segments: 10โ€“20 (core only)

Resulting walks:

W    genomeA 0       genomeA_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20
W  genomeB 0       genomeB_chr1    45      70      >11>12>13>14>15
W  genomeB 0       genomeB_chr1    100     125     >16>17>18>19>20
W    genomeC 0       genomeC_chr1    45      75      >10>11>12>13>14>15
W  genomeC 0       genomeC_chr1    205     230     >16>17>18>19>20
W    genomeD 0       genomeD_chr1    45      95      >11>12>13>14>15>16>17>18>19>20

โš ๏ธ Problem:

  • genomeB is split into 2 separate walks (45โ€“70 and 100โ€“125)

  • genomeC is split into 2 separate walks (45โ€“75 and 205โ€“230)

  • The subgraph is fragmented, losing biological continuity

Parameter: -d 55 (default: 100% of 55bp query)

Command:

$ gratools get_subgraph -g all_test.gfa \
    -s genomeA -c genomeA_chr1 \
    --start 45 --stop 100
    # Uses default -d 55 (100% of query length)

Logic: Nearby intervals separated by โ‰ค55bp are merged. Intermediate segments are pulled in, creating more complete paths while staying proportional to the query size.

Graph visualization:

digraph { graph [rankdir=LR, bgcolor="transparent"]; node [shape=box, style=filled, fontname="Courier", fontsize=10]; subgraph cluster_query { label="Extracted: Query + Bridged Region"; color="#e35025"; style=bold; Q1 [label="Query\n(45โ€“100)", shape=invhouse, fillcolor="#e35025", fontcolor=white]; C10 [fillcolor="#e35025", fontcolor=white, label="10โ€“20"]; } subgraph cluster_bridge { label="Bridged: Within 55bp"; color="#f6cc51"; style=bold; B30 [fillcolor="#f6cc51", label="30โ€“35\n(gap < 55bp)"]; } subgraph cluster_ignore { label="Ignored: Beyond 55bp"; color=grey; style=dotted; I36 [fillcolor=lightgrey, label="36\n(far)"]; } Q1 -> C10; C10 -> B30 [color="#f6cc51", label="25bp gap < 55bp\nโ†’ MERGED"]; B30 -> I36 [style=dotted, color=red, label="Far away"]; }

With -d 55: Bridges 25bp gap; merges nearby fragments; includes repetitive region.๏ƒ

Extracted segments: 10โ€“20, 30โ€“35 (core + bridged regions)

Resulting walks:

W    genomeA 0       genomeA_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20
W  genomeB 0       genomeB_chr1    45      125     >11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20
W    genomeC 0       genomeC_chr1    45      75      >10>11>12>13>14>15
W  genomeC 0       genomeC_chr1    205     230     >16>17>18>19>20
W    genomeD 0       genomeD_chr1    45      95      >11>12>13>14>15>16>17>18>19>20

โœ… Benefit:

  • genomeB is now merged into a single walk (45โ€“125) with biologically continuous path

  • Repetitive region (nodes 30โ€“35) is included

  • genomeC remains fragmented (gap > 55bp), which is correct

  • Balanced result: More complete without over-pulling

Parameter: -d 10000 (manual large value)

Command:

$ gratools get_subgraph -g all_test.gfa \
    -s genomeA -c genomeA_chr1 \
    --start 45 --stop 100 \
    -d 10000

Logic: All walks within 10000bp are merged. This bridges even very distant fragments, useful for highly fragmented assemblies.

Graph visualization:

digraph { graph [rankdir=LR, bgcolor="transparent"]; node [shape=box, style=filled, fontname="Courier", fontsize=10]; subgraph cluster_query { label="Extracted: Query Region"; color="#e35025"; style=bold; Q1 [label="Query\n(45โ€“100)", shape=invhouse, fillcolor="#e35025", fontcolor=white]; C10 [fillcolor="#e35025", fontcolor=white, label="10โ€“20"]; } subgraph cluster_bridge { label="Bridged: All within 10000bp"; color="#90EE90"; style=bold; B30 [fillcolor="#90EE90", label="30โ€“35"]; B36 [fillcolor="#90EE90", label="36\n(large tandem)"]; } Q1 -> C10; C10 -> B30 [color="#90EE90", label="25bp gap"]; B30 -> B36 [color="#90EE90", label="Far away < 10000bp\nโ†’ MERGED"]; }

With -d 10000: Bridges all gaps; pulls in all reachable fragments including long-range repeats.๏ƒ

Extracted segments: 10โ€“20, 30โ€“36 (core + all reachable)

Resulting walks:

W    genomeA 0       genomeA_chr1    45      100     >10>11>12>13>14>15>16>17>18>19>20
W  genomeB 0       genomeB_chr1    45      125     >11>12>13>14>15>30>31>32>33>34>35>16>17>18>19>20
W  genomeC 0       genomeC_chr1    45      230     >10>11>12>13>14>15>30>36>31>32>33>34>35>16>17>18>19>20
W    genomeD 0       genomeD_chr1    45      95      >11>12>13>14>15>16>17>18>19>20

โš ๏ธ Trade-off:

  • genomeC is now merged (45โ€“230, previously fragmented)

  • Large tandem repeat (node 36) is included, significantly expanding subgraph

  • Biologically relevant for some loci, but may include unrelated structural variants

  • Better for highly fragmented genomes; larger outputs

โ€”

Observation Summary๏ƒ

๐Ÿ’ก Key Findings

Case 1 (-d 0):

  • Minimal but fragmented

  • genomeB & genomeC split across multiple walks

  • Best for: Strict coordinate extraction only

Case 2 (-d 55, default 100%):

  • RECOMMENDED for most analyses

  • Balances completeness with specificity

  • genomeB successfully merged (25bp gap < 55bp threshold)

  • Automatic scaling: works across different query sizes

  • genomeC correctly remains fragmented (gap > 55bp)

Case 3 (-d 10000):

  • Maximum connectivity

  • All genomes merged (even distant fragments)

  • Larger subgraphs with long-range repeats

  • Best for: Highly fragmented assemblies

โ€”

When to Use Each Setting๏ƒ

๐ŸงŠ No Merge (-d 0)

Use when:

  • Strict coordinate extraction only

  • Assembly is well-contiguous

  • Need to avoid ambiguity about โ€œnearbyโ€

  • Minimal graph size is critical

  • Testing or validation purposes

๐Ÿ”— Default (-d 100%)

Use when:

  • Moderate fragmentation expected (typical pangenomes)

  • Want predictable, proportional behavior

  • Query sizes vary widely

  • Need biologically coherent subgraphs

  • No special requirements โ†’ use this

๐Ÿค– Manual (-d N bp)

Use when:

  • Know expected assembly gap size

  • Work with highly fragmented genomes

  • Enforce stricter filtering (small N)

  • Enforce aggressive merging (large N)

  • Have known characteristics of your data

โ€”

Impact on Subgraph Size๏ƒ

Query: 45โ€“100 (55bp span)

Parameter       Segments  Links  Walks     Complexity
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
-d 0            11        10     5 (split)  โ˜…โ˜†โ˜†
-d 55 (default) 16        14     5 (merged) โ˜…โ˜…โ˜†
-d 10000        17        15     5 (merged) โ˜…โ˜…โ˜…

Note: Larger merge distances = larger subgraphs and potentially longer processing times. The default -d 100% strikes a balance.

โ€”

๐Ÿ“‘ Quick Links