A program that computes rarefaction curves and Heaps' Law fit for a given pangenome.
🔙 To the main page of anvi’o programs and artifacts.
The program anvi-compute-rarefaction-curves goes through all genomes in a given pan-db and calculates rarefaction curves for all gene clusters and core gene clusters. It also computes the Heaps’ Law fit to model the relationship between genome sampling and the number of new gene clusters discovered for you to have a more comprehensive reporting of your pangenome.
Rarefaction curves are helpful in the analysis of pangenome as they help visualize the discovery rate of new gene clusters as a function of increasing number of genomes. While a steep curve suggests that many new gene clusters are still being discovered, indicating incomplete coverage of the potential gene cluster space, a curve that reaches a plateau suggests sufficient sampling of gene cluster diversity.
However, rarefaction curves have inherent limitations. Because genome sampling is often biased and unlikely to fully capture the true genetic diversity of any taxon, rarefaction analysis provides only dataset-specific insights. Despite these limitations, rarefaction curves remain a popular tool for characterizing whether a pangenome is relatively ‘open’ (with continuous gene discovery) or ‘closed’ (where new genome additions contribute few or no new gene clusters). As long as you take such numerical summaries with a huge grain of salt, it is all fine.
Fitting Heaps’ Law to the rarefaction curve provides a quantitative measure of pangenome openness. The alpha value derived from Heaps’ Law (sometimes referred to as gamma in the literature) reflects how the number of new gene clusters scales with increasing genome sampling. There is no science to define an absolute threshold for an open or a closed pangenome. However, pangenomes with alpha values below 0.3 tend to be relatively closed, and those above 0.3 tend to be relatively open. Higher alpha values will indicate increasingly open pangenomes and lower values will identify progressively closed ones.
The simplest for of the command will look like this,
anvi-compute-rarefaction-curves -p pan-db
Which will only report the Heaps’ Law alpha value for downstream reporting.
But you can also determine the number of random sampling to be conducted through the --iteration
parameter. The default is 100. Going above this value will unlikely refine the results, but going below 10 will have a negative influence since the fit will be affected by small amount of sampling:
anvi-compute-rarefaction-curves -p pan-db \ --iterations 50
When an output file is provided, the program will store the rarefaction curve visualizations in a file:
anvi-compute-rarefaction-curves -p pan-db \ --iterations 50 \ --output-file rarefactions.svg
Please note, the file extension (e.g., .pdf
, .svg
, .png
, etc.) will determine the resulting file format.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.