gratools get_fasta๏ƒ

Efficiently extract genomic sequences from pangenome graphs into FASTA format.

๐Ÿ“‘ Sequence Extraction

The get_fasta command extracts sequences from a GFA graph for a specific genomic region. It uses a query sample as a coordinate reference to find the corresponding paths for any other samples in the pangenome.

Think of it as a specialized version of gratools get_subgraph optimized for direct FASTA output.

Options๏ƒ

๐Ÿ› ๏ธ View Command Line Options
$ gratools get_fasta
Welcome to GraTools version: '0.1.0.dev134'
@author: GraTools team's
        ____                 __________               ____          
      6MMMMMb/               MMMMMMMMMM               `MM          
     8P    YM               /   MM     \               MM          
    6M      Y ___  __    ___    MM   _____     _____   MM   ____   
    MM        `MM 6MM  6MMMMb   MM  6MMMMMb   6MMMMMb  MM  6MMMMb\ 
    MM         MM69 " 8M'  `Mb  MM 6M'   `Mb 6M'   `Mb MM MM'    ` 
    MM     ___ MM'        ,oMM  MM MM     MM MM     MM MM YM.      
    MM     `M' MM     ,6MM9'MM  MM MM     MM MM     MM MM  YMMMMb  
    YM      M  MM     MM'   MM  MM MM     MM MM     MM MM      `Mb 
     8b    d9  MM     MM.  ,MM  MM YM.   ,M9 YM.   ,M9 MM L    ,MM 
      YMMMMM9  _MM_   `YMMM9'Yb_MM_ YMMMMM9   YMMMMM9 _MM_MYMMMM9 
        \                                    /                /
        /''A''\          /''''''\           /     /''''A'''''\
  ...GC|       |..ATG...C...CG...T....TAG..'..GC.|            |...
        \..C../      \.............../            \...TATA.../
 
Please cite our gitlab: https://forge.ird.fr/diade/gratools.git\

Usage: gratools get_fasta [OPTIONS]
Aliases: fasta

  Extracts sequences corresponding to a defined genomic region (query sample,
  chromosome, start/end) for the query sample and any additional specified
  samples. The output is a FASTA file containing these sequences. This command
  effectively runs a subgraph extraction focused on sequence output. Relies on a
  pre-existing GraTools import.
  
  For more details, see the full documentation:
  https://gratools.readthedocs.io/en/latest/commands/get_fasta.html

Extraction Query Options:
  -g, --gfa PATH
     Path to the input GFA file (e.g., myGraph.gfa or myGraph.gfa.gz).
     [required]

  -o, --outdir DIRECTORY
     Output directory for GraTools results. If not specified, results are
     typically placed in a subdirectory within the GFA file's parent directory
     (e.g., 'GraTools-output_<gfa_name>').

  -su, --suffix TEXT
     Custom suffix to append to output filenames. If not provided, a default
     suffix will be generated based on the command line parameters.

  -sq, --sample-query TEXT
     Name of the primary query sample to define the region.  [required]

  -chr, --chrom-query TEXT
     Name of the chromosome for the query region.  [required]

  -s, --start-query INTEGER RANGE
     Start position of the query region on the chromosome (0-based).  [default:
     0; x>=0]

  -e, --stop-query INTEGER RANGE
     Stop position of the query region on the chromosome (exclusive). Defaults
     to chromosome end if not provided.  [x>=1]

  -d, --merge-dist INTEGER RANGE
     Merge distance for 'bedtools merge -d'. If -1, uses 10% of query region
     length. 0 for abutting. See bedtools merge docs.  [default: -1; x>=-1]

  -sl, --samples-list FILE
     Path to a file listing additional sample names (one per line) to include in
     the extraction. Mutually exclusive with --all-samples.

  -as, --all-samples
     Include all samples from the GFA in the extraction (relative to the query
     region). Mutually exclusive with --samples-list.

Logging Options:
  -vv, --verbosity [DEBUG|INFO|ERROR]
     Set the logging verbosity level.  [default: INFO]

  -l, --log-path DIRECTORY
     Directory where the log files will be saved. If not specified, logs will be
     placed in the main output directory (or in a default GraTools log
     location).

Performance Options:
  -t, --threads INTEGER
     Number of threads to be used for parallelizable operations.  [default: 1]

Other options:
  -h, --help
     Show this message and exit.

Constraints:
  {--samples-list, --all-samples}
     The --samples-list and --all-samples options are mutually exclusive.If
     neither is provided, sequences might be extracted only for the query
     sample.

Usage Examples๏ƒ

1. Extract Sequence for a Single Sample

Extract a 5kb region for the sample CG14 based on its own coordinates.

$ gratools get_fasta -g Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --start-query 10000 --stop-query 15000
 GLOBAL: Samples progression: โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 1/1
 CG14: End processing         โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 5/5
 Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta
2. Comparative Extraction (All Samples)

Extract the same genomic locus for every sample present in the graph. GraTools will find the paths in all individuals that correspond to the CG14 reference window.

$ gratools get_fasta -g Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --start-query 10000 --stop-query 15000 \
    --all-samples --threads 8
 GLOBAL: Samples progression: โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 4/4
 Og20: End processing         โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 5/5
 Og103: End processing        โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 5/5
 Og182: End processing        โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 5/5
 Tog5681: End processing      โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 5/5
 Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta
3. Extraction for a Custom List

Extract sequences only for a subset of samples defined in a text file.

$ gratools get_fasta -g Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 \
    --samples-list list_samples2extract.txt
 GLOBAL: Samples progression: โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 2/2
 Og20: End processing         โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 5/5
 Og103: End processing        โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 5/5
 Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta

Illustrated Process๏ƒ

How GraTools navigates from a Graph to a FASTA file:

๐Ÿ“ 1. Define Coordinates

Linear Reference Coordinate: The --sample-query and --chrom-query options specify the coordinate system. The --start-query and --stop-query options define the region of interest within that coordinate.

digraph Alignment { rankdir=LR; node [shape=box, style=filled, fontname="Courier"]; subgraph cluster_sequences { label="Samples"; style=rounded; color=grey; penwidth=0; s1 [label="reference:"]; s2 [label="sample2:"]; s3 [label="sample3:"]; s4 [label="sample4:"]; } subgraph cluster_unaligned { label="Unaligned Sequences"; style=dotted; color=lightgrey; // Slightly different color penwidth=0; u1 [label="ACGGTTAAGGGCGATCGCTCGTTTT"]; u2 [label="ACGGTTAAGCGATCGCTCGTTTT"]; u3 [label="ACCGTTACGATCGAACTCG"]; u4 [label="ACCGTTAAGGGCGATCGAAATTTT"]; u1 -> u2 [style=invis]; u2 -> u3 [style=invis]; u3 -> u4 [style=invis]; {rank=same; u1; u2; u3; u4;} } subgraph cluster_alignment { label="Aligned Sequences"; style=dotted; color=grey; penwidth=0; a1 [label="ACGGTTAAGGGCGATCG--CTCGTTTT"]; a2 [label="ACGGTTAAG--CGATCG--CTCGTTTT"]; a3 [label=<AC<font color="red">C</font>GTTA----CGATCGAACTCG---->] a4 [label=<AC<font color="red">C</font>GTTAAGGGCGATCGAA----TTTT>]; a1 -> a2 [style=invis]; a2 -> a3 [style=invis]; a3 -> a4 [style=invis]; {rank=same; a1;a2;a3;a4;} // Force alignment to same rank } // Link corresponding sample to unaligned and then aligned sequences s1 -> u1 [style=dotted, color=grey]; s2 -> u2 [style=dotted, color=grey]; s3 -> u3 [style=dotted, color=grey]; s4 -> u4 [style=dotted, color=grey]; u1 -> a1 [style=dotted, color=grey]; u2 -> a2 [style=dotted, color=grey]; u3 -> a3 [style=dotted, color=grey]; u4 -> a4 [style=dotted, color=grey]; }

๐Ÿ•ธ๏ธ 2. Traverse Graph

Pangenome Graph: The graph contains nodes (sequences) and edges (connections). The linear reference guides the traversal of the graph. Different samples might have variations in this region. The blue path in the graph represents the linear reference path used for extraction.

// Pangenome Representation digraph PangenomeGraph { rankdir=LR; // Set graph direction to left-to-right // Define nodes with labels representing the genome structure AC [label="AC"]; C1 [label="C"]; G1 [label="G"]; GTTAA [label="GTTAA"]; G2 [label="G"]; GG [label="GG"]; C2 [label="C"]; GATCG [label="GATCG"]; CTCG [label="CTCG"]; AA [label="AA"]; TTTT [label="TTTT"]; // Define edges showing relationships or paths between nodes AC -> G1 [color="blue"]; AC -> C1 [color="red"]; G1 -> GTTAA [color="blue"]; C1 -> GTTAA [color="red"]; GTTAA -> G2 [color="blue"]; GTTAA -> GATCG [color="red"]; G2 -> C2 [color="red"]; G2 -> GG [color="blue"]; GG -> C2 [color="blue"]; C2 -> GATCG [color="blue"]; GATCG -> CTCG [color="blue"]; GATCG -> AA [color="red"]; AA -> CTCG [color="red"]; CTCG -> TTTT [color="blue"]; AA -> TTTT [color="red"]; }

๐Ÿงฌ 3. Extract FASTA

Extracted Sequence: get_fasta follows the paths in the graph corresponding to the specified region, generating a FASTA sequence for each selected sample. The example below shows the extraction for sample=reference, chromosome=region, start=4, and end=18.

// Pangenome Representation digraph PangenomeGraph { rankdir=LR; // Set graph direction to left-to-right // Define nodes with labels representing the genome structure AC [label="AC"]; C1 [label="C"]; G1 [label="G"]; GTTAA [label="GTTAA", style="filled,rounded"]; G2 [label="G", style="filled,rounded"]; GG [label="GG", style="filled,rounded"]; C2 [label="C", style="filled,rounded"]; GATCG [label="GATCG", style="filled,rounded"]; CTCG [label="CTCG"]; AA [label="AA"]; TTTT [label="TTTT"]; // Define edges showing relationships or paths between nodes AC -> G1 [color="blue"]; AC -> C1 [color="red"]; G1 -> GTTAA [color="blue"]; C1 -> GTTAA [color="red"]; GTTAA -> G2 [color="blue"]; GTTAA -> GATCG [color="red"]; G2 -> C2 [color="red"]; G2 -> GG [color="blue"]; GG -> C2 [color="blue"]; C2 -> GATCG [color="blue"]; GATCG -> CTCG [color="blue"]; GATCG -> AA [color="red"]; AA -> CTCG [color="red"]; CTCG -> TTTT [color="blue"]; AA -> TTTT [color="red"]; }

The following diagram illustrates how the linear reference guides sequence extraction from the pangenome graph:

digraph Alignment { rankdir=LR; node [shape=box, style=filled, fontname="Courier"]; subgraph cluster_sequences { label="Samples"; style=rounded; color=grey; penwidth=0; s1 [label="reference:"]; s2 [label="sample2:"]; s3 [label="sample3:"]; s4 [label="sample4:"]; } subgraph cluster_unaligned { label="Unaligned Sequences"; style=dotted; color=lightgrey; // Slightly different color penwidth=0; u1 [label="GTTAAGGGCGATCG"]; u2 [label="GTTAAGCGATCG"]; u3 [label="GTTACGATCG"]; u4 [label="GTTAAGGGCGATCG"]; u1 -> u2 [style=invis]; u2 -> u3 [style=invis]; u3 -> u4 [style=invis]; {rank=same; u1; u2; u3; u4;} } subgraph cluster_alignment { label="Aligned Sequences"; style=dotted; color=grey; penwidth=0; a1 [label="GTTAAGGGCGATCG"]; a2 [label="GTTAAG--CGATCG"]; a3 [label="GTTA----CGATCG"] a4 [label="GTTAAGGGCGATCG"]; a1 -> a2 [style=invis]; a2 -> a3 [style=invis]; a3 -> a4 [style=invis]; {rank=same; a1;a2;a3;a4;} // Force alignment to same rank } // Link corresponding sample to unaligned and then aligned sequences s1 -> u1 [style=dotted, color=grey]; s2 -> u2 [style=dotted, color=grey]; s3 -> u3 [style=dotted, color=grey]; s4 -> u4 [style=dotted, color=grey]; u1 -> a1 [style=dotted, color=grey]; u2 -> a2 [style=dotted, color=grey]; u3 -> a3 [style=dotted, color=grey]; u4 -> a4 [style=dotted, color=grey]; }

๐Ÿ’ก Pro Tip: Subgraph vs FASTA
  • Use get_fasta if you need the actual DNA sequences for alignment, phylogenetics, or primer design.

  • Use gratools get_subgraph if you need to visualize the graph topology (e.g., with Bandage) or if you want to keep the links between segments.

โ€”

๐Ÿ“‘ Quick Links