Generate a table that comprehensively summarizes the variability of nucleotide, codon, or amino acid positions. We call these single nucleotide variants (SNVs), single codon variants (SCVs), and single amino acid variants (SAAVs), respectively.
🔙 To the main page of anvi’o programs and artifacts.
contigs-db profile-db structure-db bin variability-profile splits-txt
This program takes the variability data stored within a profile-db and compiles it from across samples into a single matrix that comprehensively describes your SNVs, SCVs or SAAVs (a variability-profile-txt).
This program is described on this blog post, so take a look at that for more details.
Here is a basic run with no bells or whisles:
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C DEFAULT \ -b EVERYTHING
Note that this program requires you to specify a subset of the databases that you want to focus on, so to focus on everything in the databases, run anvi-script-add-default-collection and use the resulting collection and bin, as shown above.
You can add structural annotations by providing a structure-db.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C DEFAULT \ -b EVERYTHING \ -s structure-db
Instead of focusing on everything (providing the collection
DEFAULT and the bin
EVERYTHING), there are three ways to focus on a subset of the input:
Provide a list of gene caller IDs (as a parameter with the flag
--gene-caller-ids as shown below, or as a file with the flag
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ --gene-caller-ids 1,2,3
Provide a splits-txt to focus only on a specific set of splits.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ --splits-of-intest splits-txt
Provide some other collection and bin.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin
When providing a structure-db, you can also limit your analysis to only genes that have structures in your database.
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -s structure-db \ --only-if-structure
You can also choose to look at only data from specific samples by providing a file with one sample name per line. For example
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --samples-of-interest my_samples.txt
my_samples.txt looks like this:
DAY_17A DAY_18A DAY_22A …
Which one you’re analyzing depends entirely on the
engine parameter, which you can set to
CDN (codons), or
AA (amino acids). The default value is nucleotides. Note that to analyze SCVs or SAAVs, you’ll have needed to use the flag
--profile-SCVs when you ran anvi-profile.
For example, to analyze SAAVs, run
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine AA
To analyze SCVs, run
anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine CDN
You can filter the output in various ways, so that you can get straight to the variability positions that you’re most interested in. Here are some of the filters that you can set:
You can also set
--quince-mode, which reports the variability data across all samples for each position reported (even if that position isn’t variable in some samples). For example, if nucleotide position 34 of contig 1 was a SNV in one sample, the output would contain data for nucleotide position 34 for all of your samples.
The default behavior is to report codon/amino-acid frequencies only at positions where variation was reported during profiling (which by default uses some heuristics to minimize the impact of error-driven variation). Fair enough, but for some diabolical cases, you may want to report even invariant positions. When this flag is used, all positions are reported, regardless of whether they contained variation in any sample. The reference codon for all such entries is given a codon frequency of 1. All other entries (aka those with legitimate variation to be reported) remain unchanged. This flag can only be used with
--engine AA or
--engine CDN and is incompatible wth
This flag was added in this pull request where you can read about all of the tests that were performed to ensure this mode is behaving properly.
You can also ask the program to report the contig names, split names, and gene-level coverage statistics, which appear as additional columns in the output.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.