modules-db [artifact]

DB

A DB-type anvi’o artifact. This artifact is typically generated, used, and/or exported by anvi’o (and not provided by the user)..

🔙 To the main page of anvi’o programs and artifacts.

Provided by

anvi-setup-kegg-kofams

Required or used by

anvi-migrate

Description

A database containing information from the KEGG MODULE database for use in metabolic reconstruction and functional annotation of KEGG Orthologs (KOs).

This database is part of the kegg-data directory. You can get it on your computer by running anvi-setup-kegg-kofams. Programs that rely on this database include anvi-run-kegg-kofams and anvi-estimate-metabolism.

Most users will never have to interact directly with this database. However, for the brave few who want to try this (or who are figuring out how anvi’o works under the hood), there is some relevant information below.

Database Contents

The kegg_modules table

In the current implementation, data about each metabolic pathway from the KEGG MODULE database is present in the kegg_modules table, which looks like this:

module data_name data_value data_definition line
M00001 ENTRY M00001 Pathway 1
M00001 NAME Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate NULL 2
M00001 DEFINITION (K00844,K12407,K00845,K00886,K08074,K00918) (K01810,K06859,K13810,K15916) (K00850,K16370,K21071,K00918) (K01623,K01624,K11645,K16305,K16306) K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406) NULL 3
M00001 ORTHOLOGY K00844 hexokinase/glucokinase [EC:2.7.1.1 2.7.1.2] [RN:R01786] 4
M00001 ORTHOLOGY K12407 hexokinase/glucokinase [EC:2.7.1.1 2.7.1.2] [RN:R01786] 4
(…) (…) (…) (…) (…)

These data correspond to the information that can be found on the KEGG website for each metabolic module - for an example, you can see the page for M00001 (or, alternatively, its flat text file version from the KEGG REST API).

The module column indicates the module ID number while the data_name column indicates what type of data the row is describing about the module. These data names are usually fairly self-explanatory - for instance, the DEFINITION rows describe the module definition and the ORTHOLOGY rows describe the KEGG Orthologs (KOs) belonging to the module - however, for an official explanation, you can check the KEGG help page.

The data_value and data_definition columns hold the information corresponding to the row’s data_name; for ORTHOLOGY fields these are the KO number and the KO’s functional annotation, respectively. Not all rows have a data_definition field.

Finally, some rows of data originate from the same line in the original KEGG MODULE text file; these rows will have the same number in the line column. Perhaps this is a useless field. But it is there.

The database hash value

In the self table of this database, there is an entry called hash. This string is a hash of the contents of the database, and it allows us to identify the version of the data within the database. This value is important for ensuring that the same MODULES.db is used both for annotating a contigs database with anvi-run-kegg-kofams and for estimating metabolism on that contigs database with anvi-estimate-metabolism.

You can easily check the hash value by running the following:

anvi-db-info modules-db

It will appear in the DB Info section of the output, like so:

DB Info (no touch also)
===============================================
num_modules ..................................: 443
total_entries ................................: 13720
creation_date ................................: 1608740335.30248
hash .........................................: 45b7cc2e4fdc

If you have annotated a contigs-db using anvi-run-kegg-kofams, you would find that the corresponding hash in that contigs database matches to this one:

anvi-db-info contigs-db

DB Info (no touch also)
===============================================
[....]
modules_db_hash ..............................: 45b7cc2e4fdc

Querying the database

If you want to extract information directly from this database, you can do it with a bit of SQL :)

Here is one example, which obtains the name of every module in the database:

# learn where the MODULES.db is:
export ANVIO_MODULES_DB=`python -c "import anvio; import os; print(os.path.join(os.path.dirname(anvio.__file__), 'data/misc/KEGG/MODULES.db'))"`
# get module names:
sqlite3 $ANVIO_MODULES_DB "select module,data_value from kegg_modules where data_name='NAME'" | \
    tr '|' '\t' > module_names.txt

Edit this file to update this information.