A program that computes rarefaction curves and Heaps' Law fit for a given pangenome.
🔙 To the main page of anvi’o programs and artifacts.
The program anvi-compute-rarefaction-curves goes through all genomes in a given pan-db and calculates rarefaction curves for all gene clusters and core gene clusters. It also computes the Heaps’ Law fit to model the relationship between genome sampling and the number of new gene clusters discovered for you to have a more comprehensive reporting of your pangenome.
Rarefaction curves are helpful in the analysis of pangenome as they help visualize the discovery rate of new gene clusters as a function of increasing number of genomes. While a steep curve suggests that many new gene clusters are still being discovered, indicating incomplete coverage of the potential gene cluster space, a curve that reaches a plateau suggests sufficient sampling of gene cluster diversity.
However, rarefaction curves have inherent limitations. Because genome sampling is often biased and unlikely to fully capture the true genetic diversity of any taxon, rarefaction analysis provides only dataset-specific insights. Despite these limitations, rarefaction curves remain a popular tool for characterizing whether a pangenome is relatively ‘open’ (with continuous gene discovery) or ‘closed’ (where new genome additions contribute few or no new gene clusters). As long as you take such numerical summaries with a huge grain of salt, it is all fine.
Fitting Heaps’ Law to the rarefaction curve provides a quantitative measure of pangenome openness. The alpha value derived from Heaps’ Law (sometimes referred to as gamma in the literature) reflects how the number of new gene clusters scales with increasing genome sampling. There is no science to define an absolute threshold for an open or a closed pangenome. However, pangenomes with alpha values below 0.3 tend to be relatively closed, and those above 0.3 tend to be relatively open. Higher alpha values will indicate increasingly open pangenomes and lower values will identify progressively closed ones.
The simplest for of the command will look like this,
anvi-compute-rarefaction-curves -p pan-db
Running this program on a pan-db will,
The program will use the ‘project name’ information stored in the pan-db as a ‘prefix’ to all resulting files, and the output will look like this:
Number of genomes found ......................: 5
Number of iterations to run ..................: 100
Output file prefix ...........................: TEST
Heaps' Law parameters estimated ..............: K=245.3049, alpha=0.2484
OUTPUT FILES
===============================================
Rarefaction curves ...........................: TEST-rarefaction-curves.svg
GC gain per genome for core (averages) .......: TEST-rarefaction-core-averages.txt
GC gain per genome for core (each iteration) .: TEST-rarefaction-core-iterations.txt
GC gain per genome for all (averages) ........: TEST-rarefaction-pangenome-averages.txt
GC gain per genome for all (each iteration) ..: TEST-rarefaction-pangenome-iterations.txt
You can change the prefix to something else,
anvi-compute-rarefaction-curves -p pan-db \ --output-file-prefix MY_PREFIX
or you can ask the program to not generate any output files and simply report the Heaps’ Law parameters:
anvi-compute-rarefaction-curves -p pan-db \ --skip-output-files
You can also determine the number of random sampling to be conducted through the --iteration
parameter. The default is 100. Going above this value will unlikely refine the results, but going below 10 will have a negative influence since the fit will be affected by small amount of sampling:
anvi-compute-rarefaction-curves -p pan-db \ --iterations 50
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.