gratools get_fasta

The get_fasta command extracts sequences from a GFA graph for a specific genomic region and saves them in FASTA format. The region is defined using the coordinates of a single query sample, but the command can extract the corresponding sequences for that sample, a list of other samples, or all samples in the graph.

This command is essentially a convenient wrapper around extract_subgraph that is optimized for generating FASTA output directly.

Options

Usage Examples

Extracting a sequence for a specific sample and region

This example extracts the sequence for the sample CG14 in the region from 10,000 to 15,000 on its chromosome CG14_Chr07.

$ gratools get_fasta -g Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 --start-query 10000 --stop-query 15000
GLOBAL: Samples progression: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1
CG14: End processing         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta

Extracting sequences for all samples included in the graph

This command uses the same region defined by CG14’s coordinates but extracts the corresponding sequences for all samples in the graph using the –all-samples flag.

$ gratools get_fasta -g Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 --start-query 10000 --stop-query 15000 \
    --all-samples --threads 8
GLOBAL: Samples progression: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4/4
Og20: End processing         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Og103: End processing        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Og182: End processing        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Tog5681: End processing      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta

Extracting sequences for a list of samples provided in a file

Here, we provide a file containing a list of sample names to extract using the –samples-list option.

$ cat list_samples2extract.txt
Og20
Og103

$ gratools get_fasta -g Og_cactus.gfa.gz \
    --sample-query CG14 --chrom-query CG14_Chr07 --start-query 10000 --stop-query 15000 \
    --samples-list list_samples2extract.txt
GLOBAL: Samples progression: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2
Og20: End processing         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Og103: End processing        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 5/5
Saving in Og_cactus_subgraph_CG14-CG14_Chr07-10000-15000.fasta

Illustrated Example

Understanding the Process

  1. Linear Reference Coordinate: The --sample-query and --chrom-query options specify the coordinate system. The --start-query and --stop-query options define the region of interest within that coordinate.

    digraph Alignment {
    rankdir=LR;
    node [shape=box, style=filled, fontname="Courier"];

    subgraph cluster_sequences {
        label="Samples";
        style=rounded;
        color=grey;
        penwidth=0;

        s1 [label="reference:"];
        s2 [label="sample2:"];
        s3 [label="sample3:"];
        s4 [label="sample4:"];
    }

    subgraph cluster_unaligned {
        label="Unaligned Sequences";
        style=dotted;
        color=lightgrey;  // Slightly different color
        penwidth=0;

        u1 [label="ACGGTTAAGGGCGATCGCTCGTTTT"];
        u2 [label="ACGGTTAAGCGATCGCTCGTTTT"];
        u3 [label="ACCGTTACGATCGAACTCG"];
        u4 [label="ACCGTTAAGGGCGATCGAAATTTT"];
        u1 -> u2 [style=invis];
        u2 -> u3 [style=invis];
        u3 -> u4 [style=invis];
        {rank=same; u1; u2; u3; u4;}
    }


    subgraph cluster_alignment {
        label="Aligned Sequences";
        style=dotted;
        color=grey;
        penwidth=0;

        a1 [label="ACGGTTAAGGGCGATCG--CTCGTTTT"];
        a2 [label="ACGGTTAAG--CGATCG--CTCGTTTT"];
        a3 [label=<AC<font color="red">C</font>GTTA----CGATCGAACTCG---->]
        a4 [label=<AC<font color="red">C</font>GTTAAGGGCGATCGAA----TTTT>];
        a1 -> a2 [style=invis];
        a2 -> a3 [style=invis];
        a3 -> a4 [style=invis];

        {rank=same; a1;a2;a3;a4;} // Force alignment to same rank
    }

    // Link corresponding sample to unaligned and then aligned sequences
    s1 -> u1 [style=dotted, color=grey];
    s2 -> u2 [style=dotted, color=grey];
    s3 -> u3 [style=dotted, color=grey];
    s4 -> u4 [style=dotted, color=grey];


    u1 -> a1 [style=dotted, color=grey];
    u2 -> a2 [style=dotted, color=grey];
    u3 -> a3 [style=dotted, color=grey];
    u4 -> a4 [style=dotted, color=grey];



}
  2. Pangenome Graph: The graph contains nodes (sequences) and edges (connections). The linear reference guides the traversal of the graph. Different samples might have variations in this region. The blue path in the graph represents the linear reference path used for extraction.

    // Pangenome Representation
digraph PangenomeGraph {
    rankdir=LR; // Set graph direction to left-to-right

    // Define nodes with labels representing the genome structure
    AC [label="AC"];
    C1 [label="C"];
    G1 [label="G"];
    GTTAA [label="GTTAA"];
    G2 [label="G"];
    GG [label="GG"];
    C2 [label="C"];
    GATCG [label="GATCG"];
    CTCG [label="CTCG"];
    AA [label="AA"];
    TTTT [label="TTTT"];

    // Define edges showing relationships or paths between nodes
    AC -> G1 [color="blue"];
    AC -> C1 [color="red"];
    G1 -> GTTAA [color="blue"];
    C1 -> GTTAA [color="red"];
    GTTAA -> G2 [color="blue"];
    GTTAA -> GATCG [color="red"];
    G2 -> C2 [color="red"];
    G2 -> GG [color="blue"];
    GG -> C2 [color="blue"];
    C2 -> GATCG [color="blue"];
    GATCG -> CTCG [color="blue"];
    GATCG -> AA [color="red"];
    AA -> CTCG [color="red"];
    CTCG -> TTTT [color="blue"];
    AA -> TTTT [color="red"];
}
  3. Extracted Sequence: get_fasta follows the paths in the graph corresponding to the specified region, generating a FASTA sequence for each selected sample. The example below shows the extraction for sample=reference, chromosome=region, start=4, and end=18.

    // Pangenome Representation
digraph PangenomeGraph {
    rankdir=LR; // Set graph direction to left-to-right

    // Define nodes with labels representing the genome structure
    AC [label="AC"];
    C1 [label="C"];
    G1 [label="G"];
    GTTAA [label="GTTAA", style="filled,rounded"];
    G2 [label="G", style="filled,rounded"];
    GG [label="GG", style="filled,rounded"];
    C2 [label="C", style="filled,rounded"];
    GATCG [label="GATCG", style="filled,rounded"];
    CTCG [label="CTCG"];
    AA [label="AA"];
    TTTT [label="TTTT"];

    // Define edges showing relationships or paths between nodes
    AC -> G1 [color="blue"];
    AC -> C1 [color="red"];
    G1 -> GTTAA [color="blue"];
    C1 -> GTTAA [color="red"];
    GTTAA -> G2 [color="blue"];
    GTTAA -> GATCG [color="red"];
    G2 -> C2 [color="red"];
    G2 -> GG [color="blue"];
    GG -> C2 [color="blue"];
    C2 -> GATCG [color="blue"];
    GATCG -> CTCG [color="blue"];
    GATCG -> AA [color="red"];
    AA -> CTCG [color="red"];
    CTCG -> TTTT [color="blue"];
    AA -> TTTT [color="red"];
}

The following diagram illustrates how the linear reference guides sequence extraction from the pangenome graph:

digraph Alignment {
    rankdir=LR;
    node [shape=box, style=filled, fontname="Courier"];

    subgraph cluster_sequences {
        label="Samples";
        style=rounded;
        color=grey;
        penwidth=0;

        s1 [label="reference:"];
        s2 [label="sample2:"];
        s3 [label="sample3:"];
        s4 [label="sample4:"];
    }

    subgraph cluster_unaligned {
        label="Unaligned Sequences";
        style=dotted;
        color=lightgrey;  // Slightly different color
        penwidth=0;

        u1 [label="GTTAAGGGCGATCG"];
        u2 [label="GTTAAGCGATCG"];
        u3 [label="GTTACGATCG"];
        u4 [label="GTTAAGGGCGATCG"];
        u1 -> u2 [style=invis];
        u2 -> u3 [style=invis];
        u3 -> u4 [style=invis];
        {rank=same; u1; u2; u3; u4;}
    }


    subgraph cluster_alignment {
        label="Aligned Sequences";
        style=dotted;
        color=grey;
        penwidth=0;

        a1 [label="GTTAAGGGCGATCG"];
        a2 [label="GTTAAG--CGATCG"];
        a3 [label="GTTA----CGATCG"]
        a4 [label="GTTAAGGGCGATCG"];
        a1 -> a2 [style=invis];
        a2 -> a3 [style=invis];
        a3 -> a4 [style=invis];

        {rank=same; a1;a2;a3;a4;} // Force alignment to same rank
    }

    // Link corresponding sample to unaligned and then aligned sequences
    s1 -> u1 [style=dotted, color=grey];
    s2 -> u2 [style=dotted, color=grey];
    s3 -> u3 [style=dotted, color=grey];
    s4 -> u4 [style=dotted, color=grey];


    u1 -> a1 [style=dotted, color=grey];
    u2 -> a2 [style=dotted, color=grey];
    u3 -> a3 [style=dotted, color=grey];
    u4 -> a4 [style=dotted, color=grey];



}