gratools stats
The stats command provides a comprehensive overview of a GFA file’s properties. It parses the graph (or uses a pre-existing GraTools index) to calculate a wide range of metrics related to its size, connectivity, and complexity. This is useful for quality control and for getting a high-level understanding of the pangenome structure.
Options
Usage Examples
1. Display Consolidated Statistics Table (Default)
This command calculates all statistics and displays them in a single, consolidated table.
$ gratools stats -g Og_cactus.gfa.gz
╭─────────────────────────┬─────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────╮
│ Category │ Metric │ Value │
├─────────────────────────┼─────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Graph Overview │ GFA File Name │ Og_cactus │
│ Graph Overview │ GFA Version │ 1.1 │
│ Graph Overview │ Total Segments (S lines) │ 2,354,995 │
│ Graph Overview │ Total Links (L lines) │ 6,670,282 │
│ Graph Overview │ Total Walks (W lines) │ 23 │
│ Graph Overview │ Unique Samples in Walks │ 5 │
│ Segment Statistics │ Total Segment Length (bp) │ 57,119,496 │
│ ... │ ... │ ... │
╰─────────────────────────┴─────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────╯
2. Display Statistics by Category
For better readability, you can display the statistics in separate tables for each category using the –by-category flag.
$ gratools stats -g Og_cactus.gfa.gz --by-category
--- GFA Statistics for: Og_cactus ---
Graph Overview Statistics
╭──────────────────────────┬───────────╮
│ Metric │ Value │
├──────────────────────────┼───────────┤
│ GFA File Name │ Og_cactus │
│ ... │ ... │
╰──────────────────────────┴───────────╯
Segment Statistics Statistics
╭───────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────╮
│ Metric │ Value │
├───────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Total Segment Length (bp) │ 57,119,496 │
│ ... │ ... │
╰───────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────╯
...
3. Save Statistics to a File
To save the output in CSV format for later analysis, use the –save flag. The file will be saved within the GraTools index directory.
$ gratools stats -g Og_cactus.gfa.gz --save
Stats Explained
The output is divided into several categories, each providing insight into a different aspect of the graph.
Graph Overview
Metric |
Description |
|---|---|
GFA File Name |
The base name of the input GFA file. |
GFA Version |
The GFA version declared in the file’s header (e.g., 1.1). |
Total Segments (S lines) |
The total count of segment lines (S-lines) in the GFA file. |
Total Links (L lines) |
The total count of link lines (L-lines) in the GFA file. |
Total Walks (W lines) |
The total count of walk lines (W-lines), representing paths for samples. |
Unique Samples in Walks |
The number of distinct sample names found in the walk lines. |
Segment Statistics
Metric |
Description |
|---|---|
Total Segment Length (bp) |
The sum of the lengths of all unique segments in the graph. Also called Graph Size. |
Average Segment Length (bp) |
The mean length of segments (Total Segment Length / Total Segments). |
Median Segment Length (bp) |
The median length across all segments. |
Avg Length of Top 5% Longest Segments (bp) |
The average length of the 5% of segments that are the longest. |
Median Length of Top 5% Longest Segments (bp) |
The median length of the 5% of segments that are the longest. |
Segment Length Distribution |
The number of segments falling into predefined length bins. |
Link Statistics
Metric |
Description |
|---|---|
Max Segment Degree |
The highest number of links connected to any single segment in the graph. |
Average Segment Degree |
The average number of links per segment (2 * Total Links / Total Segments). |
Self-Links (S1 -> S1) |
The number of links that connect a segment to itself. |
Inverted Links (S1+ -> S2-) |
The number of links connecting segments with opposite orientations. |
Both Negative Links (S1- -> S2-) |
The number of links connecting two segments, both in their reverse-complement orientation. |
Path (Walk) Statistics
Metric |
Description |
|---|---|
Total Length of All Paths (bp) |
The sum of lengths of all paths (walks). Represents the total sequence content of the input genomes. |
Graph Compression Ratio |
The ratio of total path length to total unique segment length. A value > 1 indicates sequence redundancy was collapsed into shared segments. |
Max Segments in a Single Walk |
The highest number of segments found in any single walk (W-line). |
Sum of First Segment Lengths in Walks |
The sum of the lengths of the first segment of each walk. |
Segment Sharing & Depth
Metric |
Description |
|---|---|
Avg Unique Samples per Segment (Similarity Mean) |
The average number of distinct samples that pass through a segment. |
Median Unique Samples per Segment (Similarity Median) |
The median number of distinct samples that pass through a segment. |
StdDev Unique Samples per Segment (Similarity Std) |
The standard deviation of the number of unique samples per segment. |
Avg Occurrences per Segment (Depth Mean) |
The average number of times a segment appears across all paths (walks). |
Median Occurrences per Segment (Depth Median) |
The median number of times a segment appears across all paths. |
StdDev Occurrences per Segment (Depth Std) |
The standard deviation of segment occurrences. |
Graph Structure
Metric |
Description |
|---|---|
Graph Density |
A measure of how close the graph is to being complete. Lower values indicate a sparse graph. |
Segments/Links Ratio |
The ratio of segments to links. |
Dead-End Segments (degree 1) |
The number of segments that have only one link connected to them. |
Isolated Segments (degree 0) |
The number of segments that have no links. |
Number of Connected Components (CCs) |
The number of distinct, separate subgraphs within the GFA. |
Largest CC Size (bp) |
The total sequence length (sum of segment lengths) of the largest connected component. |
Number of Disconnected CCs (excluding largest) |
The count of all connected components other than the largest one. |
Total Length of Disconnected CCs (bp) |
The sum of the sequence lengths of all components other than the largest one. |