Get codon usage bias (CUB) statistics of genes and functions.
đ To the main page of anviâo programs and artifacts.
contigs-db profile-db collection bin internal-genomes external-genomes
This program does not seem to provide any artifacts. Such programs usually print out some information for you to see or alter some anviâo artifacts without producing any immediate outputs.
This program calculates codon usage bias (CUB) among genes or functions.
A range of options allows control over the CUB calculation to remove statistically spurious results.
Some CUB metrics depend on a reference codon composition while others are reference-independent. The reference often constitutes expected high-expression genes, such as ribosomal proteins. This program allows customization of the reference. It also introduces an âomnibiasâ mode, in which reference compositions are determined from every gene (or function), and CUB is calculated for every combination of query and reference gene. This produces a square distance-like matrix of CUB values that can be used to cluster genes or functions by their biases relative to one another.
This command produces a table of CUB values from coding sequences in the contigs database. The first column of the table contains gene caller IDs and each subsequent column contains values for a CUB metric, e.g., the Codon Adaptation Index (CAI) of Sharp and Li, 1987 and đż of Ran and Higgs, 2012. CUB metrics that rely upon a reference codon composition, such as CAI and đż, establish this composition from genes in the genome that are annotated as ribosomal proteins by KEGG KOfams/BRITE. Certain parameters have default values to increase the statistical significance of results (--query-min-analyzed-codons
, --reference-exclude-amino-acid-count
, --reference-min-analyzed-codons
).
anvi-get-codon-usage-bias -c contigs-db \ -o path/to/output.txt
This command produces a table of CUB values for functions rather than genes. By using --function-sources
without any arguments, the output will include every functions source available in a given contigs-db, e.g., KOfam
, KEGG_BRITE
, Pfam
(you can always see the complete list of functions in your contigs-db by running the program anvi-db-info on it). The first four columns of the table before CUB values contain, respectively, gene caller IDs, function sources, accessions, and names.
anvi-get-codon-usage-bias -c contigs-db \ -o path/to/output.txt \ --function-sources
This command establishes a CUB reference codon composition from genes with functional annotations defined a supplemental file. This file should be headerless, tab-delimited, and have three columns of function annotation sources (required), function accessions, and function names (either an accession or name is required).
anvi-get-codon-usage-bias -c contigs-db \ -o path/to/output.txt \ --select-reference-functions-txt path/to/select-reference-functions.txt
In reference-dependent CUB metrics, genes or functions are compared to a reference codon composition, such as from ribosomal proteins. âOmnibiasâ mode compares each gene/function âqueryâ not to a single reference, but to each other gene/function. This generates a square distance-like matrix of CUB values that can be used to cluster genes/functions by their biases relative to one another. The output path given in the command is modified to create an output file of the omnibias matrix for each CUB metric.
anvi-get-codon-usage-bias -c contigs-db \ -o path/to/output.txt \ --omnibias
This command produces a CUB table per genome, modifying the output path given in the command to create a separate output file per genome. Collection of genomes can be provided as internal-genomes, external-genomes, or both:
anvi-get-codon-usage-bias -i internal-genomes \ -e external-genomes \ -o path/to/output.txt
The following tables show the options to get the requested results.
Get | Options |
---|---|
With certain, rather than all, CUB metrics | --metrics cai |
With query sequences also used as references | --omnibias |
With a custom reference defined by functional annotations | --select-reference-functions-txt path/to/select_reference_functions.txt |
With a custom reference defined by certain genes | --reference-gene-caller-ids 0 2 500 |
Get | Options |
---|---|
All function annotation sources | --function-sources |
All KEGG BRITE categories | --function-sources KEGG_BRITE |
All KEGG KOfams and all Pfams | --function-sources KOfam Pfam |
Certain KEGG BRITE categories | --function-sources KEGG_BRITE --function-names Ribosome Ribosome>>>Ribosomal proteins |
Certain KEGG KOfam accessions | --function-sources KOfam --function-accessions K00001 K00002 |
Certain BRITE categories and KOfam accessions | --select-functions-txt path/to/select_functions.txt |
Get | Options |
---|---|
From contigs database | --contigs-db path/to/contigs.db |
From collection of internal genomes | --contigs-db path/to/contigs.db --profile-db path/to/profile.db --collection-name my_bins |
From internal genome | --contigs-db path/to/contigs.db --profile-db path/to/profile.db --collection-name my_bins --bin-id my_bin |
From internal genomes listed in a file | --internal-genomes path/to/genomes.txt |
From external genomes (contigs databases) listed in a file | --external-genomes path/to/genomes.txt |
With certain gene IDs | --gene-caller-ids 0 2 500 |
With certain gene IDs or genes annotated with certain KOfams | --gene-caller-ids 0 2 500 --function-sources KOfam --function-accessions K00001 |
Get | Options |
---|---|
Exclude genes shorter than 300 codons from queries | --gene-min-codons 300 |
Exclude genes shorter than 300 codons from contributing to function queries | --gene-min-codons 300 --function-sources |
Exclude function queries with <300 codons | --function-min-codons 300 |
Exclude stop codons and single-codon amino acids | --exclude-amino-acids STP Met Trp |
Exclude codons for amino acids with <5 codons in >90% of genes | --pansequence-min-amino-acids 5 0.9 |
Replace codons for amino acids with <5 codons in the gene or function with NaN | --sequence-min-amino-acids 5 |
Exclude queries with <300 codons involved in the CUB calculation | --query-min-analyzed-codons 300 |
Get | Options |
---|---|
Codons with a frequency <20 are excluded from the reference and CUB calculation | --reference-exclude-amino-acid-count 20 |
Exclude references with <300 codons involved in the CUB calculation | --reference-min-analyzed-codons 300 |
Functions and function annotation sources (e.g., âKOfamâ, âPfamâ) can be provided to calculate CUB for functions rather than genes.
There are multiple options to define which functions and sources should be used as CUB queries. --functions-sources
without arguments uses all available sources that had been used to annotate genes.
--function-accessions
and --function-names
select functions from a single provided source. The following example uses both options to select COG functions.
anvi-get-codon-usage-bias -c contigs-db \ -o path/to/output_table.txt \ --function-sources COG14_FUNCTION \ --function-accessions COG0004 COG0005 \ --function-names âAmmonia channel protein AmtBâ âPurine nucleoside phorphorylaseâ
To use different functions from different sources, a tab-delimited file can be provided to functions-txt
. This headerless file must have three columns, for source, accession, and name of functions, respectively, with an entry in each row for source.
By default, selected function accessions or names do not need to be present in input genomes; the program will query any selected function accessions or names that annotated genes. This behavior can be changed using the flag, --expect-functions
, so that the program will throw an error when any of the selected accessions or names are absent.
Genes are classified in KEGG BRITE functional hierarchies by anvi-run-kegg-kofams. For example, a bacterial SSU ribosomal protein is classified in a hierarchy of ribosomal genes, Ribosome>>>Ribosomal proteins>>>Bacteria>>>Small subunit
. CUB can be calculated for function queries at each level of the hierarchy, from the most general, those genes in the Ribosome
, to the most specific â in the example, those genes in Ribosome>>>Ribosomal proteins>>>Bacteria>>>Small subunit
. Therefore, the following command returns CUB values for each annotated hierarchy level â in the example, the output would include four rows for the genes in each level from Ribosome
to Small subunit
.
anvi-get-codon-usage-bias -c contigs-db \ -o path/to/output_table.txt \ --function-sources KEGG_BRITE
A custom reference codon composition for reference-dependent CUB calculations can be defined by functional annotations and/or genes IDs. By default, the reference codon composition is defined by the concatenation of all genes in the genome annotated as ribosomal proteins by KEGG KOfams/BRITE.
Use --select-reference-functions-txt
to provide a table of functional annotations defining a custom set of reference genes. The tab-delimited file should not have a header, but should have three columns of 1) function annotation sources, such as âKEGG_BRITEâ, âKOfamâ, or âPfamâ, 2) function accession, such as âK00001â for KOfam, and 3) function name, such as âRibosome»>Ribosomal proteinsâ for KEGG_BRITE. Every row must contain a source and either a function accession or name.
Use --expect-functions
to require that all functions that define the reference are found in the genome, else an error will be thrown. By default, not all provided functions need to be represented among the reference genes.
Use --reference-gene-caller-ids
for select genes to be in the custom reference gene set. This only works when processing a single genome. An error will be thrown if not all of the provided gene caller IDs are present in the genome.
It may be useful to restrict codons in the analysis to those encoding certain amino acids. Stop codons are excluded by default from CUB calculations. Codons encoding a single amino acid (Met and Trp) do not factor into CUB calculations. Example: exclude Ala, Arg, and stop codons with --exclude-amino-acids Ala Arg STP
.
Dynamic exclusion of amino acids can be useful in CUB calculations. For example, a query gene with 1 AAT and 1 AAC encoding Asn, or âsynonymous relative frequenciesâ of 0.5 AAT and 0.5 AAC, has very little data to support comparison to the synonymous relative frequencies of a large number of Asn codons in a set of reference genes. A query with 1 AAT and 0 AAC, or synonymous relative frequencies of 1.0 AAT and 0.0 AAC, would be even more statistically insignificant. Reference-dependent CUB metrics, such as đż, rely upon the ratio of synonymous relative codon frequencies in the query and reference, and so can be skewed for queries with small counts of various codons. --pansequence-min-amino-acids
removes rarer amino acids across the dataset, setting a minimum number of codons in a minimum number of genes to retain the amino acid. For example, amino acids with <5 codons in >90% of genes will be excluded from the analysis with the arguments, --pansequence-min-amino-acids 5 0.9
.
Codons for rarer amino acids within each gene or function query can be excluded from the CUB calculation with --sequence-min-amino-acids
. For example, amino acids with <5 codons in a query will be excluded from the analysis with --sequence-min-amino-acids 5
.
Genes with fewer than a minimum number of codons can be ignored in the CUB analysis. --gene-min-codons
sets the minimum number of codons required in a gene, and this filter can be applied before and/or after the removal of rarer codons. Applied before, --gene-min-codons
filters genes by length; applied after, it filters genes by codons remaining after removing rarer codons. --min-codon-filter
can take three possible arguments: length
, remaining
, or, by default when codons are removed, both
, which applies the --gene-min-codons
filter both before and after codon removal.
It may seem redundant for remaining
and both
to both be possibilities, but this is due to the possibility of dynamic amino acid exclusion using --pansequence-min-amino-acids
. Amino acids are removed based on their frequency in a proportion of genes, so removing shorter genes by length before removing amino acids can affect which amino acids are dynamically excluded.
Get | Options |
---|---|
Exclude genes shorter than 300 codons from queries | --gene-min-codons 300 |
Exclude genes shorter than 300 codons from contributing to function queries | --gene-min-codons 300 --function-sources |
Exclude function queries with <300 codons | --function-min-codons 300 |
Exclude stop codons and single-codon amino acids | --exclude-amino-acids STP Met Trp |
Exclude codons for amino acids with <5 codons in >90% of genes | --pansequence-min-amino-acids 5 0.9 |
Replace codons for amino acids with <5 codons in the gene or function with NaN | --sequence-min-amino-acids 5 |
Exclude queries with <300 codons involved in the CUB calculation | --query-min-analyzed-codons 300 |
Functions with fewer than a minimum number of codons can be ignored in the CUB analysis using --function-min-codons
. Function codon count filters occur after gene codon count filters: the set of genes contributing to function codon frequency can be restricted by applying --gene-min-codons
.
Filters removing codons from a gene or function query reduce the codon count factoring into the CUB calculation. --query-min-analyzed-codons
ignores queries in the CUB calculation that have fewer than a minimum number of codons remaining. For example, --query-min-analyzed-codons 300
ensures that after removing codons by --exclude-amino-acids
, --pansequence-min-amino-acids
, and/or --sequence-min-amino-acids
, queries must have 300 codons involved in the CUB calculation.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.