Set up user-defined metabolic pathways into an anvi'o-compatible database.
🔙 To the main page of anvi’o programs and artifacts.
It takes as input a directory containing module files for each user-defined module, formatted in the same way as KEGG modules are. It parses these modules into the
USER_MODULES.db database. This directory of user-defined data is referred to as user-modules-data, and the help page for that artifact contains a detailed account of how to create your own module definitions and estimate their completeness.
This page will give a few details specific to running anvi-setup-user-modules.
To run this program, you must provide an input directory containing your module definitions:
anvi-setup-user-modules --user-modules /path/to/user/data/directory
This input directory must have a specific format (see section below). The
USER_MODULES.db will be generated in this directory, so you can use the same path to provide your data to anvi-estimate-metabolism when you want to estimate completeness for these modules.
The directory you provide to the
--user-modules parameter must have another folder inside of it, which must be called
modules. Inside that
modules folder, you should put text files containing the definitions of your metabolic modules - one file per module. The file should be named according to the identifier you want the module to have, and should not have any extension.
Here is an example schematic of a proper input directory:
MY_METABOLISM_DATA_DIR | |- modules |- U00001 |- U00002 |- U00003 |- U00004
U0000x files in the schematic above each contains a definition for one module. Running
anvi-setup-user-modules --user-modules MY_METABOLISM_DATA_DIR will produce a
USER_MODULES.db file in the
MY_METABOLISM_DATA_DIR folder which contains 4 modules named U00001, U00002, U00003, and U00004 (assuming those files are formatted correctly).
We use KEGG’s system for describing metabolic modules, so you will need to format your metabolic pathways in the same way. Here is an example, for a module file called
U00002 (like in the schematic above):
ENTRY U00002 NAME Nitrogen fixation (full Nif gene set) DEFINITION K02588+K02586+K02591-K00531 K02587 K02592 K02585 ORTHOLOGY K02588 NifH K02586 NifD K02591 NifK K00531 anfG K02587 NifE K02592 NifN K02585 NifB CLASS User modules; Energy metabolism; Nitrogen metabolism ANNOTATION_SOURCE K02588 KOfam K02586 KOfam K02591 KOfam K00531 KOfam K02587 KOfam K02592 KOfam K02585 KOfam ///
As you can see, there are different data types in the file, named by the all-capital word at the beginning of the line (we call this the ‘data name’). The second column of the file is the value corresponding to that type of information (‘data value’). Some data names, like ORTHOLOGY and ANNOTATION_SOURCE, also have a 3rd column further defining the data value (which we call the ‘data definition’). Each field in the file should be separated by at least two spaces. And the file must end with ‘///’ on the last line (don’t ask us why).
The data names you see in the example above are the minimum you should include to define the module. Here is a bit more information about each type of data:
--hmm-sourcedirectory name, and so on.
You can also define other data names, if you want. Some common ones that can be found in KEGG modules are COMPOUND, REACTION, PATHWAY, COMMENT, REFERENCE, and AUTHORS; but you are not limited by the ones used by KEGG.
Why must we format the module files this way, you ask? Well, to be honest, KEGG modules are formatted like this, and our infrastructure for working with that data has simply been adapted to work with arbitrary, user-defined data. KEGG makes the rules :)
If you haven’t yet run anvi-setup-kegg-kofams on your computer, you will get an error when you try to run this program. This is because KEGG data is always used in addition to user-defined modules, and we need to be aware of which KEGG modules exist so we can make sure none of the user-defined modules have the same identifiers as these.
By default, this program looks for the KEGG data in the default location, so if you have set up KEGG data in a non-default directory, you should specify the path to that directory using the
anvi-setup-user-modules --user-modules /path/to/user/data/directory --kegg-data-dir /path/to/KEGG/data/directory
If you have multiple KEGG data directories on your computer, you should specify the one that you intend to use (along with this user-defined data) for anvi-estimate-metabolism downstream. It will not make a difference if all of your modules have identifiers unique from KEGG ones, but just in case they overlap, it is better to catch this during setup rather than later during metabolism estimation. :)
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.