Get codon or amino acid frequency statistics from genomes, genes, and functions..
🔙 To the main page of anvi’o programs and artifacts.
This program calculates codon or amino acid frequencies from genes or functions.
A range of options allows calculation of different frequency statistics. This program is “maximalist,” in that it has many options that do the equivalent of a couple extra commands in R or pandas – because we (not you) tend to be lazy and prone to mistakes.
This command produces a table of codon frequencies from coding sequences in the contigs database. The first column of the table contains gene caller IDs and subsequent columns contain frequency data. The decoded amino acid is included in each codon column name with the flag,
anvi-get-codon-frequencies -c contigs-db \ -o path/to/output.txt \ --header-amino-acids
This command produces a table of function frequencies rather than gene frequencies. By using
--function-sources without any arguments, the output will include every functions source available in a given contigs-db, e.g.,
Pfam (you can always see the complete list of functions in your contigs-db by running the program anvi-db-info on it). The first four columns of the table before frequency data contain, respectively, gene caller IDs, function sources, accessions, and names.
anvi-get-codon-frequencies -c contigs-db \ --function-sources \ --function-table-output path/to/function_output.txt
In contrast to the previous example, this command produces a table of gene frequencies, but has an entry for every gene/function pair, allowing statistical interrogation of the gene components of functions. The function table output is derived from this table by grouping rows by function source, retaining only one row per gene caller ID, and summing frequencies across rows of the groups.
anvi-get-codon-frequencies -c contigs-db \ --function-sources \ --gene-table-output path/to/gene_output.txt
This command produces a table of codon frequencies from coding sequences in multiple genomes. A column is added at the beginning of the table for genome name.
The following tables show the options to get the requested results.
|Codon absolute frequencies|
|Codon relative frequencies||
|Synonymous (per-amino acid) codon relative frequencies||
|Amino acid frequencies||
|Amino acid relative frequencies||
|Summed frequencies across genes||
|Synonymous relative summed frequencies across genes||
|Summed frequencies across genes annotated by each function source||
|Relative summed frequencies across genes with KOfam annotations||
|Average frequencies across all genes||
|All function annotation sources||
|All KEGG BRITE categories||
|All KEGG KOfams and all Pfams||
|Certain KEGG BRITE categories||
|Certain KEGG KOfam accessions||
|Certain BRITE categories and KOfam accessions||
|From contigs database||
|From collection of internal genomes||
|From internal genome||
|From internal genomes listed in a file||
|From external genomes (contigs databases) listed in a file||
|With certain gene IDs||
|With certain gene IDs or genes annotated with certain KOfams||
|Exclude genes shorter than 300 codons||
|Exclude genes shorter than 300 codons from contributing to function codon frequencies||
|Exclude functions with <300 codons||
|Exclude stop codons and single-codon amino acids||
|Only include certain codons||
|Exclude codons for amino acids with <5 codons in >90% of genes||
|Replace codons for amino acids with <5 codons in the gene or function with NaN||
This flag returns the relative frequency of each codon among the codons encoding the same amino acid, e.g., 0.4 GCC and 0.6 GCT for Ala. By default, stop codons and single-codon amino acids (Met ATG and Trp TGG) in the standard translation table are excluded, equivalent to using
--exclude-amino-acids STP Met Trp for other frequency statistics.
--average produce a table with a single row of frequencies from across genes. For example, the following command sums the codon frequencies of each decoded amino acid (and STP) across all genes, and then calculates the relative frequencies of the amino acids.
anvi-get-codon-frequencies -c contigs-db \ -o path/to/output_table.txt \ --sum \ --amino-acid \ --relative
The first column of the output table has the header, ‘gene_caller_ids’, and the value, ‘all’, indicating that the data is aggregated across genes.
--average operate on genes. When used with a function option, the program subsets the genes annotated by the functions of interest. With
--average, it calculates the average frequency across genes rather than functions (sums of genes with functional annotation). For example, the following command calculates the average synonymous relative frequency across genes annotated by
anvi-get-codon-frequencies -c contigs-db \ -o path/to/output_table.txt \ --average \ --synonymous \ --function-sources KOfam
Functions and function annotation sources can be provided to subset genes (as seen in the last section with
--average) and to calculate statistics for functions in addition to genes (as seen in a previous example.
--output-file is equivalent to
--gene-table-output rather than
--function-table-output, producing rows containing frequencies for annotated genes rather than summed frequencies for functions.
There are multiple options to define which functions and sources should be used.
--function-sources without arguments uses all available sources that had been used to annotate genes.
--function-names select functions from a single provided source. The following example uses both options to select COG functions.
anvi-get-codon-frequencies -c contigs-db \ -o path/to/output_table.txt \ --function-sources COG14_FUNCTION \ --function-accessions COG0004 COG0005 \ --function-names “Ammonia channel protein AmtB” “Purine nucleoside phosphorylase”
To use different functions from different sources, a tab-delimited file can be provided to
functions-txt. This headerless file must have three columns, for source, accession, and name of functions, respectively, with an entry in each row for source.
By default, selected function accessions or names do not need to be present in the input genomes; the program will return data for any selected function accessions or names that annotated genes. This behavior can be changed using the flag,
--expect-functions, so that the program will throw an error when any of the selected accessions or names are absent.
Genes are classified in KEGG BRITE functional hierarchies by anvi-run-kegg-kofams. For example, a bacterial SSU ribosomal protein is classified in a hierarchy of ribosomal genes,
Ribosome>>>Ribosomal proteins>>>Bacteria>>>Small subunit. Codon frequencies can be calculated for genes classified at each level of the hierarchy, from the most general, those genes in the
Ribosome, to the most specific – in the example, those genes in
Ribosome>>>Ribosomal proteins>>>Bacteria>>>Small subunit. Therefore, the following command returns summed codon frequencies for each annotated hierarchy level – in the example, the output would include four rows for the genes in each level from
anvi-get-codon-frequencies -c contigs-db \ -o path/to/output_table.txt \ --function-sources KEGG_BRITE
It may be useful to restrict codons in the analysis to those encoding certain amino acids. Stop codons and the single codons encoding Met and Trp are excluded by default from calculation of synonymous codon relative frequencies (
--synonymous). Relative frequencies across codons in a gene (
--relative) are calculated for the selected amino acids, so the following option would return a table of codon frequencies relative to the codons encoding the selected nonpolar amino acids:
--include-amino-acids Gly Ala Val Leu Met Ile.
Dynamic exclusion of amino acids can be useful in the calculation of synonymous codon frequencies. For example, 0.5 AAT and 0.5 AAC for Asn may be statistically insignificant for a gene with 1 AAT and 1 AAC; even more meaningless would be 1.0 AAT and 0.0 AAC for a gene with 1 AAT and 0 AAC.
--pansequence-min-amino-acids removes rarer amino acids across the dataset, setting a minimum number of codons in a minimum number of genes to retain the amino acid. For example, amino acids with <5 codons in >90% of genes will be excluded from the analysis with
--pansequence-min-amino-acids 5 0.9.
Codons for rarer amino acids within each gene or function row can be excluded in the results table (replaced by NaN) with
--sequence-min-amino-acids. This parameter only affects how the results are displayed. For example, amino acids with <5 codons in each row will be discarded in the results table with
Removal of genes with few codons can improve the statistical utility of relative frequencies.
--gene-min-codons sets the minimum number of codons required in a gene, and this filter can be applied before and/or after the removal of rarer codons. Applied before,
--gene-min-codons filters genes by length; applied after, it filters genes by codons remaining after removing rarer codons.
--min-codon-filter can take three possible arguments:
remaining, or, by default when codons are removed,
both, which applies the
--gene-min-codons filter both before and after codon removal.
It may seem redundant for
both to both be possibilities, but this is due to the possibility of dynamic amino acid exclusion using
--pansequence-min-amino-acids. Amino acids are removed based on their frequency in a proportion of genes, so removing shorter genes by length before removing amino acids can affect which amino acids are dynamically excluded.
--function-min-codons can be used to filter functions with a minimum number of codons. Function codon count filters occur after gene codon count filters: the set of genes contributing to function codon frequency can be restricted by applying
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.