anvi-gen-variability-profile

Generate a table that comprehensively summarizes the variability of nucleotide, codon, or amino acid positions. We call these single nucleotide variants (SNVs), single codon variants (SCVs), and single amino acid variants (SAAVs), respectively.

🔙 To the main page of anvi’o programs and artifacts.

Authors

Can consume

contigs-db profile-db structure-db bin variability-profile splits-txt genes-of-interest-txt

Can provide

variability-profile-txt

Usage

This program takes the variability data stored within a profile-db and compiles it from across samples into a single matrix that comprehensively describes your SNVs, SCVs or SAAVs (a variability-profile-txt).

This program is described in detail in this blog post, which also covers the biological motivation, output column definitions, and example use cases.

A note on default filtering of variability signal during profiling

Before this program even runs, anvi’o applies a dynamic, coverage-dependent filter during the profiling step (i.e., when you run anvi-profile on your BAM files) to reduce the reporting of variation due to sequencing errors and other factors that are difficult to distinguish frmo noise. Specifically, the base frequencies observed at a position in the read recruitment data are only stored in the profile-db if departure_from_reference value for that position, which quantifies the fraction of reads that differ from the reference nucleotide at a given position, meets or exceeds a coverage-dependent minimum threshold defined by:

\[y = \left(\frac{1}{b}\right)^{x^{\frac{1}{b}} - m} + c\]

where $x$ is the coverage depth and the model parameters are $b = 2$, $m = 1.45$, $c = 0.05$. This gives a threshold of approximately 0.17 at 20X coverage, 0.07 at 50X, and asymptotically approaches 0.05 at very high coverage. Positions that do not meet this threshold are silently excluded from the database and will never appear in the output of this program.

The user can bypass this filter entirely and store all observed variation regardless of frequency by including --report-variability-full when running anvi-profile. But the use of this parameter will dramatically increase database size, and should only be used with extreme care and attention (this is a very politically correct way to say “please do not use this parameter unless you are certain that this is what you need”).

Basic usage

Here is a basic run with no bells or whistles:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C DEFAULT \ -b EVERYTHING

Note that this program requires you to specify a subset of the data to focus on, so to work with everything in the databases, first run anvi-script-add-default-collection and use the resulting collection and bin as shown above.

You can also specify an output file path, which is useful when running multiple times with different --engine settings:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C DEFAULT \ -b EVERYTHING \ --output-file /path/to/your/variability.txt

Choosing what to analyze

Focusing on a subset of the input

Instead of a collection and bin, there are three alternatives:

  1. Provide gene caller IDs directly or as a file:

    anvi-gen-variability-profile -p profile-db \ -c contigs-db \ --gene-caller-ids 1,2,3

    anvi-gen-variability-profile -p profile-db \ -c contigs-db \ --genes-of-interest genes-of-interest-txt

  2. Provide a splits-txt to focus on a specific set of splits:

    anvi-gen-variability-profile -p profile-db \ -c contigs-db \ --splits-of-interest splits-txt

  3. Provide any collection and bin:

    anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin

Restricting to specific samples

You can limit the analysis to a subset of samples by providing a file with one sample name per line:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --samples-of-interest my_samples.txt

where my_samples.txt looks like:

DAY_17A DAY_18A DAY_22A …

Excluding intergenic positions

For nucleotide-level analyses, you can restrict output to positions that fall within gene calls:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --exclude-intergenic

Choosing the variability type (engine)

Which type of variants you analyze depends on the --engine parameter:

Engine Variant type Requires
NT (default) Single nucleotide variants (SNVs) standard profiling
CDN Single codon variants (SCVs) --profile-SCVs at profiling time
AA Single amino acid variants (SAAVs) --profile-SCVs at profiling time

To analyze SAAVs:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine AA

To analyze SCVs:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine CDN

Adding structural annotations

You can add structural annotations by providing a structure-db:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C DEFAULT \ -b EVERYTHING \ -s structure-db

When a structure-db is provided, you can also limit your analysis to only genes that have structures in the database:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -s structure-db \ --only-if-structure

Filtering the output

By departure from reference or consensus

departure_from_reference is the fraction of reads at a position that differ from the reference nucleotide. departure_from_consensus is similar but measured against the most frequent allele in that sample. You can set minimum and maximum bounds on either:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --min-departure-from-reference 0.05 \ --max-departure-from-reference 0.90

By minimum occurrence across samples

To keep only positions that are variable in at least N samples (useful for reducing stochastic noise):

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --min-occurrence 3

By coverage across all samples

To remove any position that has insufficient coverage in even one sample (requires --quince-mode to have coverage data for all samples at all positions):

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --quince-mode \ --min-coverage-in-each-sample 10

By number of positions per split

To randomly subsample variable positions and keep at most N per split:

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --num-positions-from-each-split 100

Special reporting modes

–quince-mode

By default, if a position is variable in only some samples, the other samples will have no entry for that position in the output. With --quince-mode, the program goes back to the raw data and fills in allele frequencies for every sample at every reported position, even those where no variation was detected. This is essential for statistical approaches that require a complete matrix.

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --quince-mode

Note that --quince-mode substantially increases runtime and output file size.

–kiefl-mode

When using --engine AA or --engine CDN, the default behavior reports only positions that had detectable variation during profiling. With --kiefl-mode, all positions in the analyzed genes are reported, with invariant positions given a reference allele frequency of 1. This is useful for analyses that need a complete picture of every codon or amino acid position, not just the variable ones. Incompatible with --quince-mode.

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine CDN \ --kiefl-mode

This flag was added in this pull request where you can read about the tests performed to validate its behavior.

Additional output columns

Contig and split names

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --include-contig-names \ --include-split-names

Contig names are excluded by default since they can nearly double file size.

Gene-level coverage statistics

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --compute-gene-coverage-stats

This appends per-gene coverage statistics to each row. It is computationally expensive and off by default.

Per-site pN/pS values (CDN engine only)

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine CDN \ --include-site-pnps

This adds 12 columns of per-site synonymous and nonsynonymous substitution information, computed relative to the reference, the consensus, and the most common consensus across samples.

Additional data from the database

anvi-gen-variability-profile -p profile-db \ -c contigs-db \ -C collection \ -b bin \ --engine AA \ --include-additional-data

This appends any data stored in the amino_acid_additional_data table as extra columns. Currently only supported for the AA engine.

Edit this file to update this information.

Additional Resources

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.