gratools get_fasta
The get_fasta command extracts sequences from a GFA graph for a specific genomic region and saves them in FASTA format. The region is defined using the coordinates of a single query sample, but the command can extract the corresponding sequences for that sample, a list of other samples, or all samples in the graph.
This command is essentially a convenient wrapper around extract_subgraph that is optimized for generating FASTA output directly.
Options
Usage Examples
Extracting a sequence for a specific sample and region
This example extracts the sequence for the sample CG14 in the region from 10,000 to 15,000 on its chromosome CG14_Chr07.
$ gratools get_fasta -g Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 --start-query 10000 --stop-query 15000
GLOBAL: Samples progression: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1
CG14: End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta
Extracting sequences for all samples included in the graph
This command uses the same region defined by CG14’s coordinates but extracts the corresponding sequences for all samples in the graph using the –all-samples flag.
$ gratools get_fasta -g Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 --start-query 10000 --stop-query 15000 \
--all-samples --threads 8
GLOBAL: Samples progression: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4/4
Og20: End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Og103: End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Og182: End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Tog5681: End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta
Extracting sequences for a list of samples provided in a file
Here, we provide a file containing a list of sample names to extract using the –samples-list option.
$ cat list_samples2extract.txt
Og20
Og103
$ gratools get_fasta -g Og_cactus.gfa.gz \
--sample-query CG14 --chrom-query CG14_Chr07 --start-query 10000 --stop-query 15000 \
--samples-list list_samples2extract.txt
GLOBAL: Samples progression: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2
Og20: End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Og103: End processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta
Illustrated Example
Understanding the Process
Linear Reference Coordinate: The
--sample-queryand--chrom-queryoptions specify the coordinate system. The--start-queryand--stop-queryoptions define the region of interest within that coordinate.![digraph Alignment {
rankdir=LR;
node [shape=box, style=filled, fontname="Courier"];
subgraph cluster_sequences {
label="Samples";
style=rounded;
color=grey;
penwidth=0;
s1 [label="reference:"];
s2 [label="sample2:"];
s3 [label="sample3:"];
s4 [label="sample4:"];
}
subgraph cluster_unaligned {
label="Unaligned Sequences";
style=dotted;
color=lightgrey; // Slightly different color
penwidth=0;
u1 [label="ACGGTTAAGGGCGATCGCTCGTTTT"];
u2 [label="ACGGTTAAGCGATCGCTCGTTTT"];
u3 [label="ACCGTTACGATCGAACTCG"];
u4 [label="ACCGTTAAGGGCGATCGAAATTTT"];
u1 -> u2 [style=invis];
u2 -> u3 [style=invis];
u3 -> u4 [style=invis];
{rank=same; u1; u2; u3; u4;}
}
subgraph cluster_alignment {
label="Aligned Sequences";
style=dotted;
color=grey;
penwidth=0;
a1 [label="ACGGTTAAGGGCGATCG--CTCGTTTT"];
a2 [label="ACGGTTAAG--CGATCG--CTCGTTTT"];
a3 [label=<AC<font color="red">C</font>GTTA----CGATCGAACTCG---->]
a4 [label=<AC<font color="red">C</font>GTTAAGGGCGATCGAA----TTTT>];
a1 -> a2 [style=invis];
a2 -> a3 [style=invis];
a3 -> a4 [style=invis];
{rank=same; a1;a2;a3;a4;} // Force alignment to same rank
}
// Link corresponding sample to unaligned and then aligned sequences
s1 -> u1 [style=dotted, color=grey];
s2 -> u2 [style=dotted, color=grey];
s3 -> u3 [style=dotted, color=grey];
s4 -> u4 [style=dotted, color=grey];
u1 -> a1 [style=dotted, color=grey];
u2 -> a2 [style=dotted, color=grey];
u3 -> a3 [style=dotted, color=grey];
u4 -> a4 [style=dotted, color=grey];
}](../_images/graphviz-187fe3137f027f3c7ce569b5b734907bfdf0d23f.png)
Pangenome Graph: The graph contains nodes (sequences) and edges (connections). The linear reference guides the traversal of the graph. Different samples might have variations in this region. The blue path in the graph represents the linear reference path used for extraction.
![// Pangenome Representation
digraph PangenomeGraph {
rankdir=LR; // Set graph direction to left-to-right
// Define nodes with labels representing the genome structure
AC [label="AC"];
C1 [label="C"];
G1 [label="G"];
GTTAA [label="GTTAA"];
G2 [label="G"];
GG [label="GG"];
C2 [label="C"];
GATCG [label="GATCG"];
CTCG [label="CTCG"];
AA [label="AA"];
TTTT [label="TTTT"];
// Define edges showing relationships or paths between nodes
AC -> G1 [color="blue"];
AC -> C1 [color="red"];
G1 -> GTTAA [color="blue"];
C1 -> GTTAA [color="red"];
GTTAA -> G2 [color="blue"];
GTTAA -> GATCG [color="red"];
G2 -> C2 [color="red"];
G2 -> GG [color="blue"];
GG -> C2 [color="blue"];
C2 -> GATCG [color="blue"];
GATCG -> CTCG [color="blue"];
GATCG -> AA [color="red"];
AA -> CTCG [color="red"];
CTCG -> TTTT [color="blue"];
AA -> TTTT [color="red"];
}](../_images/graphviz-7c8463e66d44fb4d1141ed47ffda10b3c7c2b4bb.png)
Extracted Sequence:
get_fastafollows the paths in the graph corresponding to the specified region, generating a FASTA sequence for each selected sample. The example below shows the extraction for sample=reference, chromosome=region, start=4, and end=18.![// Pangenome Representation
digraph PangenomeGraph {
rankdir=LR; // Set graph direction to left-to-right
// Define nodes with labels representing the genome structure
AC [label="AC"];
C1 [label="C"];
G1 [label="G"];
GTTAA [label="GTTAA", style="filled,rounded"];
G2 [label="G", style="filled,rounded"];
GG [label="GG", style="filled,rounded"];
C2 [label="C", style="filled,rounded"];
GATCG [label="GATCG", style="filled,rounded"];
CTCG [label="CTCG"];
AA [label="AA"];
TTTT [label="TTTT"];
// Define edges showing relationships or paths between nodes
AC -> G1 [color="blue"];
AC -> C1 [color="red"];
G1 -> GTTAA [color="blue"];
C1 -> GTTAA [color="red"];
GTTAA -> G2 [color="blue"];
GTTAA -> GATCG [color="red"];
G2 -> C2 [color="red"];
G2 -> GG [color="blue"];
GG -> C2 [color="blue"];
C2 -> GATCG [color="blue"];
GATCG -> CTCG [color="blue"];
GATCG -> AA [color="red"];
AA -> CTCG [color="red"];
CTCG -> TTTT [color="blue"];
AA -> TTTT [color="red"];
}](../_images/graphviz-5ed56400697f96c58ca0ad62bb043e47c0ce1ed1.png)
The following diagram illustrates how the linear reference guides sequence extraction from the pangenome graph:
![digraph Alignment {
rankdir=LR;
node [shape=box, style=filled, fontname="Courier"];
subgraph cluster_sequences {
label="Samples";
style=rounded;
color=grey;
penwidth=0;
s1 [label="reference:"];
s2 [label="sample2:"];
s3 [label="sample3:"];
s4 [label="sample4:"];
}
subgraph cluster_unaligned {
label="Unaligned Sequences";
style=dotted;
color=lightgrey; // Slightly different color
penwidth=0;
u1 [label="GTTAAGGGCGATCG"];
u2 [label="GTTAAGCGATCG"];
u3 [label="GTTACGATCG"];
u4 [label="GTTAAGGGCGATCG"];
u1 -> u2 [style=invis];
u2 -> u3 [style=invis];
u3 -> u4 [style=invis];
{rank=same; u1; u2; u3; u4;}
}
subgraph cluster_alignment {
label="Aligned Sequences";
style=dotted;
color=grey;
penwidth=0;
a1 [label="GTTAAGGGCGATCG"];
a2 [label="GTTAAG--CGATCG"];
a3 [label="GTTA----CGATCG"]
a4 [label="GTTAAGGGCGATCG"];
a1 -> a2 [style=invis];
a2 -> a3 [style=invis];
a3 -> a4 [style=invis];
{rank=same; a1;a2;a3;a4;} // Force alignment to same rank
}
// Link corresponding sample to unaligned and then aligned sequences
s1 -> u1 [style=dotted, color=grey];
s2 -> u2 [style=dotted, color=grey];
s3 -> u3 [style=dotted, color=grey];
s4 -> u4 [style=dotted, color=grey];
u1 -> a1 [style=dotted, color=grey];
u2 -> a2 [style=dotted, color=grey];
u3 -> a3 [style=dotted, color=grey];
u4 -> a4 [style=dotted, color=grey];
}](../_images/graphviz-19910d48fa43fbe4252727ae2e8c4a03232c58b5.png)