gratools get_fasta๏
Efficiently extract genomic sequences from pangenome graphs into FASTA format.
The get_fasta command extracts sequences from a GFA graph for a specific genomic region. It uses a query sample as a coordinate reference to find the corresponding paths for any other samples in the pangenome.
Think of it as a specialized version of gratools get_subgraph optimized for direct FASTA output.
Options๏
๐ ๏ธ View Command Line Options
$ gratools get_fasta
Welcome to GraTools version: '0.1.0.dev134'
@author: GraTools team's
____ __________ ____
6MMMMMb/ MMMMMMMMMM `MM
8P YM / MM \ MM
6M Y ___ __ ___ MM _____ _____ MM ____
MM `MM 6MM 6MMMMb MM 6MMMMMb 6MMMMMb MM 6MMMMb\
MM MM69 " 8M' `Mb MM 6M' `Mb 6M' `Mb MM MM' `
MM ___ MM' ,oMM MM MM MM MM MM MM YM.
MM `M' MM ,6MM9'MM MM MM MM MM MM MM YMMMMb
YM M MM MM' MM MM MM MM MM MM MM `Mb
8b d9 MM MM. ,MM MM YM. ,M9 YM. ,M9 MM L ,MM
YMMMMM9 _MM_ `YMMM9'Yb_MM_ YMMMMM9 YMMMMM9 _MM_MYMMMM9
\ / /
/''A''\ /''''''\ / /''''A'''''\
...GC| |..ATG...C...CG...T....TAG..'..GC.| |...
\..C../ \.............../ \...TATA.../
Please cite our gitlab: https://forge.ird.fr/diade/gratools.git\
Usage: gratools get_fasta [OPTIONS]
Aliases: fasta
Extracts sequences corresponding to a defined genomic region (query sample,
chromosome, start/end) for the query sample and any additional specified
samples. The output is a FASTA file containing these sequences. This command
effectively runs a subgraph extraction focused on sequence output. Relies on a
pre-existing GraTools import.
For more details, see the full documentation:
https://gratools.readthedocs.io/en/latest/commands/get_fasta.html
Extraction Query Options:
-g, --gfa PATH
Path to the input GFA file (e.g., myGraph.gfa or myGraph.gfa.gz).
[required]
-o, --outdir DIRECTORY
Output directory for GraTools results. If not specified, results are
typically placed in a subdirectory within the GFA file's parent directory
(e.g., 'GraTools-output_<gfa_name>').
-su, --suffix TEXT
Custom suffix to append to output filenames. If not provided, a default
suffix will be generated based on the command line parameters.
-sq, --sample-query TEXT
Name of the primary query sample to define the region. [required]
-chr, --chrom-query TEXT
Name of the chromosome for the query region. [required]
-s, --start-query INTEGER RANGE
Start position of the query region on the chromosome (0-based). [default:
0; x>=0]
-e, --stop-query INTEGER RANGE
Stop position of the query region on the chromosome (exclusive). Defaults
to chromosome end if not provided. [x>=1]
-d, --merge-dist INTEGER RANGE
Merge distance for 'bedtools merge -d'. If -1, uses 10% of query region
length. 0 for abutting. See bedtools merge docs. [default: -1; x>=-1]
-sl, --samples-list FILE
Path to a file listing additional sample names (one per line) to include in
the extraction. Mutually exclusive with --all-samples.
-as, --all-samples
Include all samples from the GFA in the extraction (relative to the query
region). Mutually exclusive with --samples-list.
Logging Options:
-vv, --verbosity [DEBUG|INFO|ERROR]
Set the logging verbosity level. [default: INFO]
-l, --log-path DIRECTORY
Directory where the log files will be saved. If not specified, logs will be
placed in the main output directory (or in a default GraTools log
location).
Performance Options:
-t, --threads INTEGER
Number of threads to be used for parallelizable operations. [default: 1]
Other options:
-h, --help
Show this message and exit.
Constraints:
{--samples-list, --all-samples}
The --samples-list and --all-samples options are mutually exclusive.If
neither is provided, sequences might be extracted only for the query
sample.
Usage Examples๏
Extract a 5kb region for the sample CG14 based on its own coordinates.
$ gratools get_fasta -g Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 \
--start-query 10000 --stop-query 15000
GLOBAL: Samples progression: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 1/1
CG14: End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta
Extract the same genomic locus for every sample present in the graph. GraTools will find the paths in all individuals that correspond to the CG14 reference window.
$ gratools get_fasta -g Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 \
--start-query 10000 --stop-query 15000 \
--all-samples --threads 8
GLOBAL: Samples progression: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 4/4
Og20: End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 5/5
Og103: End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 5/5
Og182: End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 5/5
Tog5681: End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta
Extract sequences only for a subset of samples defined in a text file.
$ gratools get_fasta -g Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 \
--samples-list list_samples2extract.txt
GLOBAL: Samples progression: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 2/2
Og20: End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 5/5
Og103: End processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta
Illustrated Process๏
How GraTools navigates from a Graph to a FASTA file:
๐ 1. Define Coordinates
Linear Reference Coordinate: The
--sample-queryand--chrom-queryoptions specify the coordinate system. The--start-queryand--stop-queryoptions define the region of interest within that coordinate.
๐ธ๏ธ 2. Traverse Graph
Pangenome Graph: The graph contains nodes (sequences) and edges (connections). The linear reference guides the traversal of the graph. Different samples might have variations in this region. The blue path in the graph represents the linear reference path used for extraction.
๐งฌ 3. Extract FASTA
Extracted Sequence: get_fasta follows the paths in the graph corresponding to the specified region, generating a FASTA sequence for each selected sample. The example below shows the extraction for sample=reference, chromosome=region, start=4, and end=18.
The following diagram illustrates how the linear reference guides sequence extraction from the pangenome graph:
Use get_fasta if you need the actual DNA sequences for alignment, phylogenetics, or primer design.
Use gratools get_subgraph if you need to visualize the graph topology (e.g., with Bandage) or if you want to keep the links between segments.
โ
๐ Quick Links
Command Import: gratools import
Full Extraction: gratools get_subgraph
Workflow Guide: ๐ Quick Start Guide