Run KOfam HMMs on an anvi'o contigs database.
🔙 To the main page of anvi’o programs and artifacts.
Essentially, this program uses the KEGG database to annotate functions and metabolic pathways in a contigs-db. More specifically, anvi-run-kegg-kofams annotates a contigs-db with HMM hits from KOfam, a database of KEGG Orthologs (KOs). You must set up these HMMs on your computer using anvi-setup-kegg-data before you can use this program. If a modules-db is available, membership of KOfam functions in KEGG metabolic MODULES and BRITE hierarchies is also stored in the contigs-db.
Running this program is a pre-requisite for metabolism estimation with anvi-estimate-metabolism. Note that if you are planning to run metabolism estimation, it must be run with the same kegg-data that is used in this program to annotate KOfam hits.
1) Run an HMM search against KOfam Briefly, what this program does is extract all the gene calls from the contigs-db and checks each one for hits to the KOfam HMM profiles in your kegg-data. This can be time-consuming given that the number of HMM profiles is quite large, even more so if the number of genes in the contigs-db is also large. Multi-threading is a good idea if you have the computational capability to do so.
2) Eliminate weak hits based on bitscore Many HMM hits will be found, most of them weak. The weak hits will by default be eliminated according to the bitscore thresholds provided by KEGG; that is, hits with bitscores below the threshold for a given KO profile will be discarded, and those with bitscores above the threshold will be annotated in the contigs-db. It is perfectly normal to notice that the number of raw hits found is many, many times larger than the number of annotated KO hits in your database.
3) Add back valid hits that were missed There is one issue with this practice of removing all KOfam hits below the KEGG bitscore threshold for a given profile. We (and others) have noticed that the KEGG thresholds can sometimes be too stringent, eliminating hits that are actually valid annotations. To solve this problem, we have implemented the following heuristic for relaxing the bitscore thresholds and annotating genes that would otherwise go without a valid KO annotation:
For every gene without a KOfam annotation, we examine all the hits with an e-value below X
and a bitscore above Y
percent of the threshold. If those hits are all to a unique KOfam profile, then we annotate the gene call with that KO.
X
and Y
are parameters that can be modified (see below), but by default the e-value threshold (X
) is 1e-05 and the bitscore fraction (Y
) is 0.5.
Please note that this strategy is just a heuristic. We have tried to pick default parameters that seemed reasonable but by no means have we comprehensively tested and optimized them. This is why X and Y are mutable so that you can explore different values and see how they work for your data. It is always a good idea to double-check your annotations to make sure they are reasonable and as stringent as you’d like them to be. In addition, if you do not feel comfortable using this heuristic at all, you can always turn this behavior off and rely solely on KEGG’s bitscore thresholds. :)
3) Put annotations in the database
In the contigs-db functions table, annotated KO hits (kegg-functions) will have the source KOfam
. If a modules-db is available, metabolic modules and BRITE functional classifications containing these functions also have entries in the table, with sources labeled KEGG_Module
and KEGG_BRITE
. BRITE classification will not occur if anvi-setup-kegg-data was not set up with BRITE data (see the artifact for that program to see how to include BRITE).
anvi-run-kegg-kofams -c contigs-db
If you have previously setup your KEGG data directory using --kegg-data-dir
(see anvi-setup-kegg-data), or have moved the KEGG data directory that you wish to use to a non-default location (maybe you like keeping the older versions around when you update, we don’t know how you roll), then you may need to specify where to find the KEGG data so that this program can use the right one. In that case, this is how you do it:
anvi-run-kegg-kofams -c contigs-db \ --kegg-data-dir /path/to/directory/KEGG
anvi-run-kegg-kofams -c contigs-db -T 4
By default, anvi-run-kegg-kofams uses hmmsearch
to find KO hits. If for some reason you would rather use a different program (hmmscan
is also currently supported), you can do so.
anvi-run-kegg-kofams -c contigs-db \ --hmmer-program hmmscan
Usually, this program parses out weak HMM hits and keeps only those that are above the score threshold for a given KO. If you would like to turn off this behavior and keep all hits (there will be a lot of weak ones), you can follow the example below:
anvi-run-kegg-kofams -c contigs-db \ --keep-all-hits
If you want to see the bitscores of all KOfam hits that were added to your contigs database, you can use the --log-bitscores
option to save these values into a tab-delimited file:
anvi-run-kegg-kofams -c contigs-db \ --log-bitscores
Here is an example of what the resulting bitscore file would look like:
entry_id | bit_score | domain_bit_score | e_value | entry_id | gene_callers_id | gene_hmm_id | gene_name |
---|---|---|---|---|---|---|---|
1 | 177.4 | 85.1 | 8e-54 | 0 | 1371 | - | K10681 |
2 | 34.1 | 33.7 | 9.1e-11 | 1 | 1141 | - | K01954 |
3 | 22.4 | 22.4 | 3.1e-07 | 2 | 1402 | - | K01954 |
4 | 12.8 | 11.8 | 0.00024 | 3 | 1099 | - | K01954 |
5 | 17.1 | 16.7 | 4.4e-05 | 4 | 1267 | - | K20024 |
Combining this flag with the --keep-all-hits
option is one way to get the bitscores of all matches to the KOfam profiles, even the ones that would usually not pass the bitscore threshold provided by KEGG.
As described above, this program does its best to avoid missing valid annotations by relaxing the bitscore threshold for genes without any annotations. For such a gene, hits with e-value <= X and bitscore > (Y * KEGG threshold) that are all hits to the same KOfam profile are used to annotate the gene with that KO.
If you don’t want any previously-eliminated hits to be used for annotation, you can skip this heuristic by using the flag --skip-bitscore-heuristic
. Then, only hits with bitscores above the KEGG-provided threshold for a given KO will be used for annotation.
anvi-run-kegg-kofams -c contigs-db \ --skip-bitscore-heuristic
If our default values are too stringent or not stringent enough for your tastes, you can change them! The e-value threshold (X, default: 1e-05) can be set using -E
or --heuristic-e-value
and the bitscore fraction (Y, default: 0.50) can be set using -H
or --heuristic-bitscore-fraction
. Like so:
anvi-run-kegg-kofams -c contigs-db \ -E 1e-15 \ -H 0.90
If for some strange reason you do not want KEGG BRITE annotations to be added to your contigs database, you can skip them by providing the --skip-brite-hierarchies
flag:
anvi-run-kegg-kofams -c contigs-db \ --skip-brite-hierarchies
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.