Download and setup various databases from KEGG.
🔙 To the main page of anvi’o programs and artifacts.
This program seems to know what its doing. It needs no input material from its user. Good program.
anvi-setup-kegg-data downloads and organizes data from KEGG for use by other programs, namely anvi-run-kegg-kofams, anvi-estimate-metabolism and anvi-reaction-network. Depending on what download mode you choose, it can download and setup one or more of the following:
Typically, some processing is done following the data download to make the data work with downstream anvi’o programs. The KOfam profiles are prepared for later use by the HMMER software, and the information from MODULES and BRITE is made accessible to other anvi’o programs as a modules-db. The Orthology data is converted into a nice table that can be utilized by anvi-reaction-network. This program generates a directory with these files (kegg-data).
You need to pick a mode to work with this program to control which data will be downloaded from KEGG. You can see the available modes by running the following command:
anvi-setup-kegg-data --list-modes
You use the --mode
parameter to tell the program which mode you want, for example:
anvi-setup-kegg-data --mode modules
If you do not provide any arguments to this program, all KEGG data (ie, --mode all
) will be set up in the default KEGG data directory.
anvi-setup-kegg-data
By default, this program downloads a snapshot of the KEGG databases, already converted into an anvi’o-compatible format. The snapshot is a .tar.gz
archive of a KEGG data directory that was (usually) generated around the time of the latest anvi’o release.
After the default KEGG archive is downloaded, it is unpacked, checked that all the expected files are present, and moved into the KEGG data directory.
Doing it this way ensures that almost everyone uses the same version of KEGG data, which is good for reproducibility and makes it easy to share annotated datasets. The KEGG resources are updated fairly often, and we found that constantly keeping the KEGG data directory in sync with them was not ideal, because every time the data directory is updated, you have to update the KOfam annotations in all your contigs databases to keep them compatible with the current modules-db (unless you were smart enough to keep the old version of the KEGG data directory around somewhere). And of course that introduces a new nightmare as soon as you want to share datasets with your collaborators who do not have the same KEGG data directory version as you. With everyone using the same kegg-data by default, we can avoid these issues.
But the trade-off to this is that the default KEGG data version is tied to an anvi’o release, and it will not always include the most up-to-date information from KEGG. Luckily, for those who want the most updated version of KEGG, you can still use this program to generate the KEGG data directory by downloading directly from KEGG (see ‘Getting the most up-to-date KEGG data’ section below).
BRITE hierarchy data is not included in the default KEGG snapshot for anvi’o v7
. Starting from the v7.1-dev
version of anvi’o, there is a new default KEGG snapshot including BRITE information. If you are missing this data, it can be acquired by either installing a later snapshot or by independently downloading it with this program using --mode modules
.
The data for metabolic modeling are not included in the KEGG snapshots created before anvi’o v8
. If you are missing this data, it can be acquired by either installing a later snapshot or by independently downloading it with this program using --mode modeling
.
You can specify a different directory in which to put this data, if you wish:
anvi-setup-kegg-data --kegg-data-dir /path/to/directory/KEGG
This is helpful if you don’t have write access to the default directory location, or if you want to keep several different versions of the KEGG data on your computer. Just remember that when you want to use this specific KEGG data directory with later programs such as anvi-run-kegg-kofams, you will have to specify its location with the --kegg-data-dir
flag.
By default, the KEGG snapshot that will be installed is the latest one, which is up-to-date with your current version of anvi’o. If, however, you want a snapshot from an earlier version, you can run something like the following to get it:
anvi-setup-kegg-data --kegg-data-dir /path/to/directory/KEGG \ --kegg-snapshot v2020-04-27
Just keep in mind that you may need to migrate the MODULES.db from these earlier versions in order to make it compatible with the current metabolism code. Anvi’o will tell you if you need to do this.
Not sure what KEGG snapshots are available for you to request? Well, you could check out the YAML file at anvio/anvio/data/misc/KEGG-SNAPSHOTS.yaml
in your anvi’o directory, or you could just give something random to the --kegg-snapshot
parameter and watch anvi’o freak out and tell you what is available:
anvi-setup-kegg-data --kegg-snapshot hahaha
This program is also capable of downloading data directly from KEGG and converting it into an anvi’o-compatible format. In fact, this is how we generate the default KEGG archive. If you want the latest KEGG data instead of the default snapshot of KEGG, try the following:
anvi-setup-kegg-data --download-from-kegg
Please note that this will download all the KEGG data (ie, --mode all
is the default). If you want to independently download individual KEGG datasets, you should pick one of the other modes (the --download-from-kegg
flag is implicitly turned on in these modes).
KOfam profiles are downloadable from KEGG’s FTP site and all other KEGG data is accessible as flat text files through their API. When you run this program it will first get all the files that it needs from these sources, and then it will process them by doing the following:
orphan_data
folder in your KEGG data directory)hmmpress
on themAn important thing to note about this option is that it has rigid expectations for the format of the KEGG data that it works with. Future updates to KEGG may break things such that the data can no longer be directly obtained from KEGG or properly processed. In the sad event that this happens, you will have to download KEGG from one of our archives instead.
The --only-download
flag works for KOfam
mode and modules
mode.
Suppose you only want to download data from KEGG without processing it. For instance, perhaps you don’t need a modules-db or you don’t want hmmpress
to be run on the KOfam profiles. You can instruct this program to stop after downloading by providing the --only-download
flag:
anvi-setup-kegg-data --mode modules \ --only-download \ --kegg-data-dir /path/to/directory/KEGG
It’s probably a good idea in this case to specify where you want this data to go using --kegg-data-dir
, to make sure you can find it later.
This option is primarily useful for developers to test anvi-setup-kegg-data
- for instance, so that you can download the data once and run the database setup option (--only-processing
) multiple times. However, if non-developers find another practical use-case for this flag, we’d be happy to add those ideas here. Send us a message, or feel free to edit this file and pull request your changes on the anvi’o Github repository. :)
The --only-processing
flag works for KOfam
mode and modules
mode.
Let’s say you already have KEGG data on your computer that you got by running this program with the --only-download
flag. Now you want to process the HMM files, or turn the MODULES data into a modules-db. To do that, run this program using the --only-processing
flag and provide the location of the pre-downloaded KEGG data:
anvi-setup-kegg-data --mode modules \ --only-processing \ --kegg-data-dir /path/to/directory/KEGG
The KEGG data that you already have on your computer has to be in the format expected by this program, or you’ll run into errors. Pretty much the only reasonable way to get the data into the proper format is to run this program with the --only-download
option. Otherwise you would have to go through a lot of manual file-changing shenanigans - possible, but not advisable.
One more note: since this flag is most often used for testing the database setup capabilities of this program, which entails running anvi-setup-kegg-data --mode modules --only-processing
multiple times on the same KEGG data directory, there is an additional flag that may be useful in this context. To avoid having to manually delete the created modules database each time you run, you can use the --overwrite-output-destinations
flag:
anvi-setup-kegg-data --mode modules \ --only-processing \ --kegg-data-dir /path/to/directory/KEGG \ --overwrite-output-destinations
As of anvi’o v7.1-dev
or later, KEGG BRITE hierarchies are added to the modules-db when running this program with --mode modules
. If you don’t want this cool new feature - because you are a rebel, or adverse to change, or something is not working on your computer, whatever - then fine. You can use the --skip-brite-hierarchies
flag:
anvi-setup-kegg-data --mode modules --skip-brite-hierarchies
Hopefully it makes sense to you that this flag does not work when setting up from a KEGG snapshot that already includes BRITE data in it.
Suppose you have been living on the edge and annotating your contigs databases with a non-default version of kegg-data, and you share these databases with a collaborator who wants to run downstream programs like anvi-estimate-metabolism on them. Your collaborator (who has a different version of kegg-data on their computer) will likely get version errors as detailed on the anvi-estimate-metabolism help page.
In order for your collaborator to be able to work with your dataset, they need to have the same kegg-data version as you did when you ran anvi-run-kegg-kofams. If you are very lucky and KEGG has not been updated since you set up your kegg-data, they may be able to run anvi-setup-kegg-data -D
to get it. But if not, there are a few options for you to share your version of kegg-data:
tar -czvf kegg_archive.tar.gz ./KEGG
on the data directory to compress and archive it before sending it over (this command must be run from its parent directory so that the archive has the expected directory structure when it is unpacked). Then your collaborator can just run anvi-setup-kegg-data --kegg-archive kegg_archive.tar.gz --kegg-data-dir ./KEGG_ARCHIVE
and be good to go. They would just have to use --kegg-data-dir ./KEGG_ARCHIVE
when running downstream programs. The problem here is that even the archived kegg-data is quite large, ~4-5GB, and may be unfeasible for you to send.--kegg-data-dir
parameter.Kofam.hmm
file and hmmpress
must be run on that file to generate the required indices for hmmsearch
. Your collaborator must also have the ko_list.txt
file (which should be downloaded with the profiles) in the right spot. Then they could pass their makeshift KEGG data directory to anvi-run-kegg-kofams using --kegg-data-dir
, and they should be golden. (A word of warning: they may want to remove KOs without bitscore thresholds in the ko_list.txt
before concatenating the profiles, otherwise they will likely get a lot of weak hits for these KOs.)If you have an archive (.tar.gz
) of the KEGG data directory already on your computer (perhaps a colleague or Meren Lab developer gave you one), you can set up KEGG from this archive instead:
anvi-setup-kegg-data --kegg-archive KEGG_archive.tar.gz
This works the same way as the default, except that it bypasses the download step and instead uses the archive file you have provided with --kegg-archive
.
Periodically (especially before releasing a new version of anvi’o), we want to add new KEGG database snapshots to anvi’o so that users can have more up-to-date KEGG data without having to use the --download-from-kegg
option. In this section you will find the instructions for doing this (these instructions are also in the comments of the anvio/data/misc/KEGG-SNAPSHOTS.yaml
file).
Available KEGG snapshots are stored in the anvi’o code repository in anvio/data/misc/KEGG-SNAPSHOTS.yaml
. To add a new snapshot, you first need to create one by downloading and processing the data from KEGG, testing to make sure it works, and then updating this file. Here are the steps:
anvi-setup-kegg-data -D --kegg-data-dir ./KEGG -T 5
. This will create the new KEGG data folder with its modules-db in your current working directory. Make sure you use the exact folder name of ./KEGG
, because that is what anvi’o expects to find when it unpacks a KEGG snapshot. You may want to reduce or increase the number of threads (-T
) according to your available compute resources.anvi-db-info ./KEGG/MODULES.db
.tar -czvf KEGG_build_YYYY-MM-DD_HASH.tar.gz ./KEGG
. Please remember to replace YYYY-MM-DD with the current date and replace HASH with the MODULES.db hash value obtained in step 2. This convention makes it easier to distinguish between KEGG snapshots by simply looking at the file name.anvi-setup-kegg-data --kegg-archive KEGG_build_YYYY-MM-DD_HASH.tar.gz --kegg-data-dir TEST_NEW_KEGG_ARCHIVE
..tar.gz
archive to Figshare. If you need inspiration for filling out the keywords, categories, and description fields for the archive, you can check the previous KEGG snapshots that have been uploaded - for instance, this one or this one. At minimum, we typically indicate the database version and hash value, and an example setup command (ie, the one from step 4), in the description of the dataset. Once the archive is published on Figshare (warning: this usually takes a while due to the large file size), you can get the download url of the archive by right-clicking on the Download button and copying the address, which should be a URL with a format similar to this example (but different numbers): https://figshare.com/ndownloader/files/34817812
anvio/data/misc/KEGG-SNAPSHOTS.yaml
file with the Figshare download URL, archive name, and MODULES.db hash and version. If you want this to become the default snapshot (which usually only changes before the next anvi’o release), you should also update the default self.target_snapshot
variable in anvio/kegg.py
to be this latest version that you have added.anvi-setup-kegg-data --kegg-data-dir TEST_NEW_KEGG
, and if it works you are done, and can push your changes to the anvi’o repository. :)If you want to get some data from the KEGG website that is not included in our default download (or, if you only want a subset of that data without going through the whole setup process), you can use the anvi’o API to utilize our download functions. Here are some examples for using the KeggSetup
class (for example, in the Python interpreter):
KeggSetup
classKeggSetup
is the class for downloading KEGG data (using KEGG’s API). To use it in Python, you need to load the kegg
module from anvi’o. When using it this way, we recommend skipping a variety of sanity checks using the skip_init
parameter - this is mainly so that the class doesn’t check for, remove, or complain about existing KEGG data on your computer.
import anvio
import argparse
from anvio import kegg
args = argparse.Namespace(reset=False)
setup = kegg.KeggSetup(args, skip_init=True)
Once you have this class loaded, you can use its functions for a variety of download and processing tasks. We’ll show some examples below.
The following example demonstrates the download of all KEGG COMPOUND files belonging to the BRITE hierarchy with accession br08001
. Note that if you do not specify a download directory, the files will by default be downloaded to the current working directory.
setup.download_kegg_files_from_hierarchy('br08001', download_dir='KEGG_COMPOUND')
### Downloading a hierarchical text file
If you just want to get a KEGG htext
file (with extension .keg
), use the following function:
setup.download_generic_htext('br08001', download_dir='KEGG_COMPOUND')
### Processing a hierarchical text file
We have a few functions for reading KEGG’s htext
files. If all you want is a list of the accessions involved in this heirarchy (for instance, all compounds in a BRITE hierarchy for KEGG COMPOUND), use this one (the argument should be the path to the htext
file):
accession_list = setup.get_accessions_from_htext_file("br08001.keg")
If you want to process the KEGG module htext
file to get a dictionary of all modules and their names/classes/etc, use the following code. You will need to set the kegg_module_file
attribute (of the ModulesDownload class) to point to the location of the modules.keg
file, and the function will store the module dictionary in the module_dict
attribute.
modules_setup = kegg.ModulesDownload(args)
modules_setup.kegg_module_file = "modules.keg"
modules_setup.process_module_file()
modules_setup.module_dict # this attribute now stores the module dictionary
### Downloading a flat file using the KEGG API
Here is a wrapper function that will ‘get’ a flat file with the KEGG API. You can provide this function with the accession of the data you want (for instance, a module accession), and optionally a directory to download it into.
setup.download_generic_flat_file('C00058', download_dir='KEGG_COMPOUND')
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.