Generate a table that comprehensively summarizes the variability of nucleotide, codon, or amino acid positions. We call these single nucleotide variants (SNVs), single codon variants (SCVs), and single amino acid variants (SAAVs), respectively.
🔙 To the main page of anvi’o programs and artifacts.
contigs-db
profile-db
structure-db
bin
variability-profile
splits-txt
genes-of-interest-txt ![]()
This program takes the variability data stored within a profile-db and compiles it from across samples into a single matrix that comprehensively describes your SNVs, SCVs or SAAVs (a variability-profile-txt).
This program is described in detail in this blog post, which also covers the biological motivation, output column definitions, and example use cases.
Before this program even runs, anvi’o applies a dynamic, coverage-dependent filter during the profiling step (i.e., when you run anvi-profile on your BAM files) to reduce the reporting of variation due to sequencing errors and other factors that are difficult to distinguish frmo noise. Specifically, the base frequencies observed at a position in the read recruitment data are only stored in the profile-db if departure_from_reference value for that position, which quantifies the fraction of reads that differ from the reference nucleotide at a given position, meets or exceeds a coverage-dependent minimum threshold defined by:
where $x$ is the coverage depth and the model parameters are $b = 2$, $m = 1.45$, $c = 0.05$. This gives a threshold of approximately 0.17 at 20X coverage, 0.07 at 50X, and asymptotically approaches 0.05 at very high coverage. Positions that do not meet this threshold are silently excluded from the database and will never appear in the output of this program.
The user can bypass this filter entirely and store all observed variation regardless of frequency by including --report-variability-full when running anvi-profile. But the use of this parameter will dramatically increase database size, and should only be used with extreme care and attention (this is a very politically correct way to say “please do not use this parameter unless you are certain that this is what you need”).
Here is a basic run with no bells or whistles:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C DEFAULT \ -b EVERYTHING
Note that this program requires you to specify a subset of the data to focus on, so to work with everything in the databases, first run anvi-script-add-default-collection and use the resulting collection and bin as shown above.
You can also specify an output file path, which is useful when running multiple times with different --engine settings:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C DEFAULT \ -b EVERYTHING \ --output-file /path/to/your/variability.txt
Instead of a collection and bin, there are three alternatives:
Provide gene caller IDs directly or as a file:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ --gene-caller-ids 1,2,3
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ --genes-of-interest genes-of-interest-txt
Provide a splits-txt to focus on a specific set of splits:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ --splits-of-interest splits-txt
Provide any collection and bin:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin
You can limit the analysis to a subset of samples by providing a file with one sample name per line:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --samples-of-interest my_samples.txt
where my_samples.txt looks like:
DAY_17A DAY_18A DAY_22A …
For nucleotide-level analyses, you can restrict output to positions that fall within gene calls:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --exclude-intergenic
Which type of variants you analyze depends on the --engine parameter:
| Engine | Variant type | Requires |
|---|---|---|
NT (default) |
Single nucleotide variants (SNVs) | standard profiling |
CDN |
Single codon variants (SCVs) | --profile-SCVs at profiling time |
AA |
Single amino acid variants (SAAVs) | --profile-SCVs at profiling time |
To analyze SAAVs:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine AA
To analyze SCVs:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine CDN
You can add structural annotations by providing a structure-db:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C DEFAULT \ -b EVERYTHING \ -s structure-db
When a structure-db is provided, you can also limit your analysis to only genes that have structures in the database:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -s structure-db \ --only-if-structure
departure_from_reference is the fraction of reads at a position that differ from the reference nucleotide. departure_from_consensus is similar but measured against the most frequent allele in that sample. You can set minimum and maximum bounds on either:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --min-departure-from-reference 0.05 \ --max-departure-from-reference 0.90
To keep only positions that are variable in at least N samples (useful for reducing stochastic noise):
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --min-occurrence 3
To remove any position that has insufficient coverage in even one sample (requires --quince-mode to have coverage data for all samples at all positions):
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --quince-mode \ --min-coverage-in-each-sample 10
To randomly subsample variable positions and keep at most N per split:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --num-positions-from-each-split 100
By default, if a position is variable in only some samples, the other samples will have no entry for that position in the output. With --quince-mode, the program goes back to the raw data and fills in allele frequencies for every sample at every reported position, even those where no variation was detected. This is essential for statistical approaches that require a complete matrix.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --quince-mode
Note that --quince-mode substantially increases runtime and output file size.
When using --engine AA or --engine CDN, the default behavior reports only positions that had detectable variation during profiling. With --kiefl-mode, all positions in the analyzed genes are reported, with invariant positions given a reference allele frequency of 1. This is useful for analyses that need a complete picture of every codon or amino acid position, not just the variable ones. Incompatible with --quince-mode.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine CDN \ --kiefl-mode
This flag was added in this pull request where you can read about the tests performed to validate its behavior.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --include-contig-names \ --include-split-names
Contig names are excluded by default since they can nearly double file size.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --compute-gene-coverage-stats
This appends per-gene coverage statistics to each row. It is computationally expensive and off by default.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine CDN \ --include-site-pnps
This adds 12 columns of per-site synonymous and nonsynonymous substitution information, computed relative to the reference, the consensus, and the most common consensus across samples.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine AA \ --include-additional-data
This appends any data stored in the amino_acid_additional_data table as extra columns. Currently only supported for the AA engine.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.