anvi-display-contigs-stats

Authors
Can consume
Can provide
Usage

Working with single or multiple contigs databases
Interactive output

How do we predict the number of genomes?

Text output

Additional Resources

Start the anvi'o interactive interface for viewing or comparing contigs statistics..

🔙 To the main page of anvi’o programs and artifacts.

Authors

Özcan C. Esen

Florian Trigodet

Andrea Watson

Can consume

contigs-db

Can provide

contigs-stats interactive svg

Usage

This program helps you make sense of contigs in one or more contigs-db files, summarizing contig lengths, gene lengths, HMM hits, and single-copy core gene (SCG) based genome estimates.

Working with single or multiple contigs databases

You can use this program on a single contigs database the following way:

anvi-display-contigs-stats A.db

Alternatively, you may use it to compare multiple contigs databases:

anvi-display-contigs-stats A.db \ B.db \ (…) X.db

If you are comparing multiple, each contigs databse will become an individual column in all outputs (columns are labeled with the project_name stored in each contigs-db).

Interactive output

If you run this program on an anvi’o contigs database with default parameters,

anvi-display-contigs-stats contigs-db

it will open an interactive interface that displays all the summary data. Here is an example with a few contigs-db files and the output it displays:

anvi-display-contigs-stats A.db \ B.db \ C.db \ D.db \ H.db \ I.db

An example of the anvi'o interface for contigs stats

At the top of the page are two graphs:

The bars in the top graph represent every integer N and L statistic from 1 to 100. The y-axis is the respective N length and the x-axis is the percentage of the total dataset looked at (the exact L and N values can be seen by hovering over each bar). In other words, if you had sorted your contigs by length (from longest to shortest), and walked through each one, every time you had seen another 1 percent of your total dataset, you would add a bar to the graph showing the number of contigs that you had seen (the L statistic) and the length of the one you were looking at at the moment (the N statistic).
The lower part of the graph tells you about which HMM hits your contigs database has. Each column is a gene in a specific hmm-source, and the graph tells you how many hits each gene has in your data. (Hover your mouse over the graph to see the specifics of each gene.) The sidebar shows you how many of the genes in this graph were seen exactly that many times. The distribution of SCGs offers an estimate about the number of genomes that are present in this sample that made it all the way to the end of the assembly. Based on this information, and given the data shown in the figure above, one should expect to recover around 1,500 bacterial genomes, 30 archaeal genomes, and no eukaryotic genome from the assembly of sample A. See the section about predicting number of genomes for more details.

Below the graphs are the contigs stats, reported in this order:

Basic contig stats
- Total Length (nucleotides)
- Num Contigs
- Num Contigs > 100 kb / 50 kb / 20 kb / 10 kb / 5 kb / 2.5 kb
- Longest Contig / Shortest Contig (nucleotides)
- Mean Contig Length (trim 10%): two-sided 10% trimmed mean of contig lengths to down-weight extremes
- L50 / L75 / L90: number of contigs required to cover 50/75/90% of the assembly when sorted by length
- N50 / N75 / N90: contig length at the 50/75/90% mark of the assembly when sorted by length
Gene stats (using the default gene caller recorded in the contigs-db, e.g., Prodigal or pyrodigal-gv)
- Num Genes
- Avg Gene Length
- Avg Gene Length (trim 10%): two-sided 10% trimmed mean of gene lengths
- Min Gene Length / Max Gene Length
HMM hits
- For every hmm-source found in the contigs-db (including SCG sets and other sources such as Ribosomal_RNAs), the total number of hits; n/a is shown for contigs databases where a source is missing.
SCG-based estimates for number of populations
- For each hmm-source used for SCGs (for example Bacteria_71, Archaea_76, Protista_83), the predicted number of genomes. See predicting number of genomes.

How do we predict the number of genomes?

Our estimates rely on the following assumption: most microbial genomes contain exactly one copy of each single-copy core gene (SCG) in the set of known SCGs for their taxonomic domain.

Under this assumption, the number of HMM hits to a known SCG in a metagenome can serve as an estimate of the number of genomes (of a given taxonomic domain) in that sample. For instance, if you see 5 hits to a bacterial ‘Ribosomal L1’ gene, you can guess that there are at least 5 bacteria in your sample. Of course, it doesn’t make sense to use the number of hits to just one SCG when there are many different ones. So what we do instead is to compute the statistical mode of the number of hits to each SCG in a given hmm-source. This is what the lower graph of HMM hits is showing at the top of the interactive page – the sideways histogram on the left side indicates the frequency of each total, so the longest bar there is the mode.

Why the mode? Consider that there are various potential issues with annotating SCGs that could throw off your estimates: you could be randomly missing some SCG hits due to incomplete sequencing of some populations in your sample, or systematically missing some SCG hits due to annotation models that don’t capture all the diversity within your sample, or finding extra SCG hits due to a technical artifact that splits the annotation of a single gene into two different hits covering separate portions of the gene (we see this sometimes with multi-domain SCGs). In addition, there can be real biological duplications or losses of SCGs that violate the principle of the ‘single-copy core’ label. We compute the mode across all SCGs to make our estimates more robust to these fluctuations (we chose to use the mode rather than the average because averages are influenced by outliers).

The SCGs for each domain can be annotated in a contigs-db with anvi-run-hmms, which utilizes three domain-specific SCG sets: the hmm-sources Bacteria_71, Archaea_76, and Protista_83. So what we do is get an estimate for each of these domains, which are the predicted number of genomes that you see in the stats table:

An example of the domain-specific predictions for number of genomes

These domain-specific predictions can be added to get an estimate of the ‘Total’ number of genomes in the sample.

There is one additional caveat to mention. Sometimes, there is sufficient sequence similarity between single-copy core genes of different domains for a gene to be annotated with HMMs from more than one domain (e.g., a bacterial model and an archaeal model). When this happens, we don’t include the affected SCG model in our calculation of the mode. This sometimes leads to differences between the mode shown on the SCG hit graphs and the estimates in the stats table. If this happens in your data, you will see a warning like this on your terminal:

WARNING
===============================================
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source()
function. Just so you know, someone asked for SCG HMMs that belong to multiple
sources *not* to be counted, and this will result in 70 models to be removed
from our counts, more specifically: 29 from Bacteria_71, 31 from Archaea_76, 10
from Protista_83. You can run this program with the `--debug` flag if you want
to see a list of the models that we will ignore from each HMM source.

We implemented the above strategy to avoid double-counting certain genomes. You can see a discussion of this particular issue here.

If you want some additional context, this method was originally described in this blog post before being implemented as part of this program. Also, please note that you can obtain these estimates from any contigs-db programmatically with the anvi’o Python library as described here.

Text output

If you wish to report contigs-db stats as a supplementary table, a text output will be much more appropriate. If you add the flag --report-as-text anvi’o will not attempt to initiate an interactive interface, and instead will report the stats as a TAB-delmited file. The table contains the rows listed above (basic contig stats, gene stats, HMM hits, and SCG-based genome estimates) in the same order:

anvi-display-contigs-stats contigs-db \ --report-as-text \ -o OUTPUT_FILE_NAME.txt

There is also another flag you can add to get the output formatted as markdown (--as-markdown), which makes it easier to copy-paste to GitHub or other markdown-friendly services. This is how you get a markdown output instead:

anvi-display-contigs-stats contigs-db \ --report-as-text \ --as-markdown \ -o OUTPUT_FILE_NAME.md

Here is an example excerpt from a markdown-formatted report (numbers are from a single run and shown only to illustrate structure):

	`A`	`B`	`C`	`D`	`H`	`I`
Total Length	5747428710	4481865631	7234715430	4279426713	5129708529	5538888058
Num Contigs	187938	174549	231236	130052	201889	207768
Num Contigs > 100 kb	5716	3330	6938	4605	3385	4328
Num Contigs > 50 kb	19157	12848	25123	14142	13723	15809
Num Contigs > 20 kb	93053	75234	125814	61939	87667	90509
Num Contigs > 10 kb	171706	154977	215106	117645	181195	183506
Num Contigs > 5 kb	187448	173850	230546	129702	200903	206274
Num Contigs > 2.5 kb	187938	174549	231234	130051	201888	207768
Longest Contig	4098864	2443875	4239191	5922001	4180701	4037354
Shortest Contig	3040	2548	1776	1217	1092	2829
Mean Contig Length (trim 10%)	22166.90	19993.67	23463.96	21949.06	20017.19	20193.47
L50	33401	37525	44702	18318	45228	41620
L75	87499	87145	111234	55335	103190	101106
L90	136328	130024	169718	91311	151730	152691
N50	36060	28646	36088	42550	27609	30066
N75	20791	18308	21621	21516	18252	18744
N90	14773	13176	15672	14631	13503	13548
Num Genes	6610990	4825470	7354355	4716548	5706502	6338677
Avg Gene Length	799.52	854.62	880.09	838.65	817.23	798.82
Avg Gene Length (trim 10%)	715.05	759.31	779.88	753.48	711.29	704.34
Min Gene Length	60	60	60	60	60	60
Max Gene Length	58464	68517	72867	66381	76293	52899
Archaea_76	119701	87059	108961	84365	94398	107056
Bacteria_71	243439	170803	190890	173842	165695	210866
HMM_DNApolB	1058	579	824	692	1897	1663
PFAM_Ribonucleotide_reductase_barrel_domain	5196	3775	4536	3540	5560	5583
Protista_83	11617	8939	13233	7687	10098	10089
RNA_Polymerase_Type_A	3739	2571	n/a	n/a	n/a	n/a
Ribosomal_RNA_16S	3285	2382	2839	2458	2110	2571
Ribosomal_RNA_18S	47	21	28	17	31	21
bacteria (Bacteria_71)	1584	1062	2636	1263	1321	1374
archaea (Archaea_76)	28	63	294	0	215	69
eukarya (Protista_83)	0	2	0	1	6	4

You can easily convert the markdown output into PDF or HTML pages using pandoc. For instance running the following command in the previous output,

pandoc -V geometry:landscape \
       OUTPUT_FILE_NAME.md \
       -o OUTPUT_FILE_NAME.pdf

will results in a PDF file that looks like this:

an anvi'o display

Edit this file to update this information.

Additional Resources

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.