gratools stats

The stats command provides a comprehensive overview of a GFA file’s properties. It parses the graph (or uses a pre-existing GraTools index) to calculate a wide range of metrics related to its size, connectivity, and complexity. This is useful for quality control and for getting a high-level understanding of the pangenome structure.

Options

Usage Examples

1. Display Consolidated Statistics Table (Default)

This command calculates all statistics and displays them in a single, consolidated table.

$ gratools stats -g Og_cactus.gfa.gz

╭─────────────────────────┬─────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────╮
│ Category                 Metric                                                       Value                                                                │
├─────────────────────────┼─────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Graph Overview           GFA File Name                                                Og_cactus                                                            │
│ Graph Overview           GFA Version                                                  1.1                                                                  │
│ Graph Overview           Total Segments (S lines)                                     2,354,995                                                            │
│ Graph Overview           Total Links (L lines)                                        6,670,282                                                            │
│ Graph Overview           Total Walks (W lines)                                        23                                                                   │
│ Graph Overview           Unique Samples in Walks                                      5                                                                    │
│ Segment Statistics       Total Segment Length (bp)                                    57,119,496                                                           │
│ ...                      ...                                                          ...                                                                  │
╰─────────────────────────┴─────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────╯

2. Display Statistics by Category

For better readability, you can display the statistics in separate tables for each category using the –by-category flag.

$ gratools stats -g Og_cactus.gfa.gz --by-category

--- GFA Statistics for: Og_cactus ---
       Graph Overview Statistics
╭──────────────────────────┬───────────╮
│ Metric                    Value     │
├──────────────────────────┼───────────┤
│ GFA File Name             Og_cactus │
│ ...                       ...       │
╰──────────────────────────┴───────────╯
                                             Segment Statistics Statistics
╭───────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────╮
│ Metric                                         Value                                                                │
├───────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Total Segment Length (bp)                      57,119,496                                                           │
│ ...                                            ...                                                                  │
╰───────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────╯
...

3. Save Statistics to a File

To save the output in CSV format for later analysis, use the –save flag. The file will be saved within the GraTools index directory.

$ gratools stats -g Og_cactus.gfa.gz --save

Stats Explained

The output is divided into several categories, each providing insight into a different aspect of the graph.

Graph Overview

Graph Overview

Metric

Description

GFA File Name

The base name of the input GFA file.

GFA Version

The GFA version declared in the file’s header (e.g., 1.1).

Total Segments (S lines)

The total count of segment lines (S-lines) in the GFA file.

Total Links (L lines)

The total count of link lines (L-lines) in the GFA file.

Total Walks (W lines)

The total count of walk lines (W-lines), representing paths for samples.

Unique Samples in Walks

The number of distinct sample names found in the walk lines.

Segment Statistics

Segment Statistics

Metric

Description

Total Segment Length (bp)

The sum of the lengths of all unique segments in the graph. Also called Graph Size.

Average Segment Length (bp)

The mean length of segments (Total Segment Length / Total Segments).

Median Segment Length (bp)

The median length across all segments.

Avg Length of Top 5% Longest Segments (bp)

The average length of the 5% of segments that are the longest.

Median Length of Top 5% Longest Segments (bp)

The median length of the 5% of segments that are the longest.

Segment Length Distribution

The number of segments falling into predefined length bins.

Path (Walk) Statistics

Path (Walk) Statistics

Metric

Description

Total Length of All Paths (bp)

The sum of lengths of all paths (walks). Represents the total sequence content of the input genomes.

Graph Compression Ratio

The ratio of total path length to total unique segment length. A value > 1 indicates sequence redundancy was collapsed into shared segments.

Max Segments in a Single Walk

The highest number of segments found in any single walk (W-line).

Sum of First Segment Lengths in Walks

The sum of the lengths of the first segment of each walk.

Segment Sharing & Depth

Segment Sharing & Depth

Metric

Description

Avg Unique Samples per Segment (Similarity Mean)

The average number of distinct samples that pass through a segment.

Median Unique Samples per Segment (Similarity Median)

The median number of distinct samples that pass through a segment.

StdDev Unique Samples per Segment (Similarity Std)

The standard deviation of the number of unique samples per segment.

Avg Occurrences per Segment (Depth Mean)

The average number of times a segment appears across all paths (walks).

Median Occurrences per Segment (Depth Median)

The median number of times a segment appears across all paths.

StdDev Occurrences per Segment (Depth Std)

The standard deviation of segment occurrences.

Graph Structure

Graph Structure

Metric

Description

Graph Density

A measure of how close the graph is to being complete. Lower values indicate a sparse graph.

Segments/Links Ratio

The ratio of segments to links.

Dead-End Segments (degree 1)

The number of segments that have only one link connected to them.

Isolated Segments (degree 0)

The number of segments that have no links.

Number of Connected Components (CCs)

The number of distinct, separate subgraphs within the GFA.

Largest CC Size (bp)

The total sequence length (sum of segment lengths) of the largest connected component.

Number of Disconnected CCs (excluding largest)

The count of all connected components other than the largest one.

Total Length of Disconnected CCs (bp)

The sum of the sequence lengths of all components other than the largest one.