Predicts metabolic exchanges between microbes based on the reaction network.
đ To the main page of anviâo programs and artifacts.
contigs-db reaction-network
equivalent-compounds-txt
external-genomes
genome-pairs
reaction-ref-data
kegg-data
metabolite-exchange-predictions
anvi-predict-metabolic-exchanges uses the metabolic capacity encoded in two genomes to predict which metabolites could be exchanged between two organisms, and which metabolites are unique to one of the organisms. The output files produced by the program are described in metabolite-exchange-predictions.
By leveraging the reaction-network infrastructure of anviâo, the program examines the overlap between the metabolic reactions that each of the two organisms can catalyze and identifies which metabolic compounds fall into one of the following categories:
It reports the metabolites in categories 1 and 2 as âpotentially-exchanged compoundsâ, and the metabolites in category 3 as âuniqueâ compounds. Here is a table of what it looks for (everything marked with a star is reported by the program, and everything with a gray background is ignored):
Briefly, the program works by checking for these conditions for every metabolite in the genomesâ (merged) reaction networks. It does so in two orthogonal ways:
Either of these strategies can be skipped, in case you prefer to use only one method. If you want to know more, check the Technical Details section.
This program uses the reaction-network stored in each contigs-db to search for potentially-exchanged or unique compounds. Since the reaction-network and this program both rely on the ModelSEED and KEGG databases (especially KEGG Pathway Maps and KOfam annotation profiles for KEGG Orthlogs), you need to have access to that data on your computer. You also need to have KEGG Ortholog (KO) annotations (kegg-functions) in your input genomes (which translates to having the functional annotation source âKOfamâ in the contigs-dbs).
Here are the steps you need to run before this program:
If youâve done all that, you are good to go.
Provide both contigs databases for your genomes and a prefix for the output files:
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX
The metabolite-exchange-predictions output file names will start using the prefix you provided.
There are two ways for running this program in âmulti-modeâ so that it predicts exchanges between multiple pairs of genomes. Both of them involve using an external-genomes file to point the program towards each contigs database. You will get one set of output files containing the predictions from all pairs.
If possible for your computer setup, we recommend increasing the number of threads using the --num-threads
or -T
parameter so that multiple pairs of genomes can be processed in parallel.
All-vs-all comparisons
If you provide only an external-genomes file, the program will determine all possible pairs of genomes and predict exchanges between all of them.
anvi-predict-metabolic-exchanges -e external-genomes \ -O ANY_PREFIX \ --num-threads 1
Specific pairwise comparisons
If you only want to predict exchanges between specific pairs of genomes, then you can enumerate those pairs in a genome-pairs file and provide that to the program (in addition to the external-genomes file).
anvi-predict-metabolic-exchanges -e external-genomes \ --genome-pairs-txt genome-pairs \ -O ANY_PREFIX \ --num-threads 1
Additional steps to run this program on internal genomes (bins)
If you have a collection of bin in a metagenome assembly, commonly described in an internal-genomes file, then you have to first run the program anvi-split to extract those bins into their own individual contigs databases. You will then need to run anvi-reaction-network on each binâs individual contigs database, because running anvi-split does not split any reaction network made for the entire metagenome into subnetworks for each bin (at least, it did not do so at the time of writing this documentation).
Once you have a reaction-network in each binâs contigs-db, you can create an external-genomes to use as input for anvi-predict-metabolic-exchanges as described above.
The ModelSEED database sometimes has multiple compound ID numbers for what (we think) should be the same metabolite, at least for the purposes of predicting exchanges. A prime example of this is amino acids, which can have compounds where their chirality is specified (as in L-Lysine (cpd00039)
) and where the chirality is generic (as in Lysine (cpd19182)
). This is not so much a problem when we are using KEGG Pathway Maps to predict exchanges, but can lead to missing predictions when using the reaction network.
For amino acids, the program can automatically detect which compound IDs should be considered equivalent, and take that into account when looking for potentially-exchanged compounds. If you use the --use-equivalent-amino-acids
flag, the program will search through the ModelSEED database for any conventional amino acids (plus Selenocysteine and Pyrrolysine) that have both an âL-â version and a chiral-unspecific version, and set those two compounds equivalent to each other.
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX \ --use-equivalent-amino-acids
If you do this, you will get an output file listing the amino acid compounds that were deemed equivalent, so you can make sure you agree with them.
If you want to specify a custom set of equivalent compound IDs, you can instead provide an equivalent-compounds-txt file to the --custom-equivalent-compounds-file
parameter:
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX \ --custom-equivalent-compounds-file equivalent-compounds-txt
If you want to skip the first prediction step of walking over KEGG Pathway Maps to find potential exchanges, use the --no-pathway-walk
flag:
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX \ --no-pathway-walk
If you want to skip the second prediction step of examining the local reaction network around each compound, use the --pathway-walk-only
flag:
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX \ --pathway-walk-only
It is hopefully understandable that these two flags are incompatible with each other.
By default, this program will process all regular-type KEGG Pathway Maps (those with map IDs starting with prefix â00â) which have Reaction (âRNâ) type KGML files available in the KEGG data directory. If there are some Pathway Maps that you donât want used for prediction (because they are irrelevant to your system, or you think they lead to misleading results, or they take too long to process, whatever), you can use the --exclude-pathway-maps
option to provide a comma-separated list of Pathway Map ID numbers (just the numbers, no string prefix of any kind).
In the example below, we exclude two Pathway Maps: map 00195 (Photosynthesis) and map 00190 (Oxidative Phosphorylation). Both of these describe several enzyme complexes and do not explicitly contain any chemical reactions; hence, they do not have âRNâ-type KGML files and would be skipped by the program anyway. The only effect of using the flag in this case would be that you donât see any warnings on the terminal about those two Pathway Maps. However, if you use the flag to exclude a map that does have an âRNâ-type KGML file, such as map 00470 (D-amino acid biosynthesis), the effect would be to remove any predictions (and accompanying evidence) that come from that Pathway Map.
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX \ --exclude-pathway-maps 00195,00190
This option is only relevant if you allow the Pathway Map Walk prediction strategy (i.e., you didnât use --no-pathway-walk
).
The --maximum-gaps
parameter applies to the first prediction step of walking over KEGG Pathway Maps, and allows a certain number of missing enzyme annotations in the reaction chains. By default, we donât allow any gaps, but if you think missing annotations in either genome might be throwing off your predictions, you can set this parameter to an integer greater than 0:
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX \ --maximum-gaps 1
Changing this parameter will mostly affect the evidence attributed to a given prediction (i.e., length and overlap of reaction chains), but there could be some cases where increasing the gap number enables new predictions to be made.
If you set up your KEGG or ModelSEED data in a custom directory, you can make sure this program knows where to find it by providing the paths:
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX \ --kegg-data-dir /path/to/directory/KEGG \ --modelseed-data-dir /path/to/directory/MODELSEED
The data directories are relevant for loading the reaction-network in the contigs database, so it is best to use the same data directories that were utilized when running anvi-reaction-network.
If you have a lot of computational resources, you can make this program work faster by parallelizing some of the tasks. For instance:
--num-threads
--num-threads
parameter. Each comparison, however, will internally utilize one thread, UNLESSâŠ--num-parallel-processes
parameter and the --num-threads
parameterConfused? Letâs see a few examples.
Multithreading âsingleâ mode
In this example, the program will predict exchanges between one pair of genomes, and it will be able to do Pathway Map walks for 4 Pathway Maps at once.
anvi-predict-metabolic-exchanges -c1 contigs-db -c2 contigs-db \ -O ANY_PREFIX \ --num-threads 4
Resource requirement: Your computer (or the job you submitted to a high-performance computing cluster) must have >4 threads available to run this example.
Why >4 and not >=4? Iâm glad you asked. You see, the program itself has to run on one thread and it will start 4 child processes to work on the Pathway Map walks, bringing the total usage up to n=5 threads. Why not spawn n-1
child processes so that the total thread usage of the program is exactly equal to the --num-threads
number? Well, that is a good idea. But in practice, the program (parent process) is not doing very much while waiting for its child processes to send their data back. And it seemed less confusing this way to begin with. If you think that extra thread will cause issues, now you know and you can set --num-threads
a bit lower accordingly. And if you are upset about this, you know who to complain to. đ
Multithreading multi-mode (multiple single-threaded comparisons at once)
In this example, the program will predict exchanges between all possible pairs of genomes in the provided external-genomes file, and it will be able to process 4 genome pairs at once. Each genome pair comparison is single-threaded in this case, meaning that there is no parallel processing of Pathway Map walks.
anvi-predict-metabolic-exchanges -e external-genomes \ -O ANY_PREFIX \ --num-threads 4
Resource requirement: Your computer (or the job you submitted to a high-performance computing cluster) must have >4 threads available to run this example.
Multiprocessing multi-mode (multiple multithreaded comparisons at once)
Finally, in this example, the program will process multiple pairs of genomes, but each genome pair comparison is multithreaded. Specifically, here we allow it to process 3 genome pairs in parallel, and each parallel process can in turn use 4 threads (so the Pathway Map Walk is multithreaded).
anvi-predict-metabolic-exchanges -e external-genomes \ -O ANY_PREFIX \ --num-threads 4 \ --num-parallel-processes 3
Resource requirement: Your computer (or the job you submitted to a high-performance computing cluster) must have >12 threads available to run this example.
Why >12 threads? Well, 4 * 3 = 12, plus there is one additional thread on which the main program is running while waiting for its child processes for a total usage of n=13 threads (confused? read the yellow box under the first example). If you are paying really close attention, you might realize that each child process needs to run on its own thread while spawning its own set of (grand)child processes. Good catch! We account for this by allowing each child process to create t-1
threads (where t
is the --num-threads
value) so that the total per-process footprint is equal to --num-threads
. So in this example, technically only 3 Pathway Map walks are done at once for each genome pair comparison even though we set --num-threads
to 4. At large scales, we cannot necessarily afford to ignore an extra thread per child process like we do for the parent process.
This section describes how the predictions of metabolic exchanges are done, for a given pair of genomes.
You might recall the table of metabolite production vs consumption possibilities at the top of this documentation page. Here it is again so that you donât have to scroll up:
To identify potentially-exchanged compounds, we look for all the cases marked by a dark green star; that is, compounds for which one of the genomes can either only produce or only consume the metabolite, such that it is possible that this genome is passing the metabolite to the other genome or receiving it from the other genome, respectively.
To identify unique compounds, we look for all the cases marked by a light green star in the table; that is, compounds for which production and/or consumption can only be done by one of the two genomes. We ignore all the other possibilities (gray background in the table) because the genomes either have no biological motivation to share the metabolite (that we can detect) or the situation raises questions that we simply cannot answer computationally.
In some cases, more than one of the organisms can produce/consume a metabolite, but we have to pick just one of the two as the âprimaryâ producer or consumer of the compound. This is how we do that for a compound that is potentially-exchanged:
This role assignment is mostly relevant for interpreting the Pathway Walk evidence (described in a later section). The output of the program will describe all genomes that can âproduceâ or âconsumeâ a compound when relevant, so you wonât be missing any information about who can do what.
In order to start searching for the cases described above, we need to first identify the capacity for production or consumption of a metabolite by looking for the enzymes that can catalyze chemical reactions with the metabolite as a product or as a substrate, respectively. This is done using the reaction-networks generated by anvi-reaction-network. Each genomeâs reaction-network matches the KEGG KOfam annotations of the genome to chemical reactions in the ModelSEED database, and includes all the enzymes/reactions/metabolites that are theoretically usable by the organism.
anvi-predict-metabolic-exchanges therefore starts by merging the two genomeâs reaction networks together, to make one big network that tracks which reactions are catalyzable by each genome:
We can then iterate over every single metabolite within the merged network to identify whether it represents into one of our cases of interest from the table. Conceptually, this means looking at the production/consumption reactions surrounding the metabolite within the network and asking which genome(s) can produce/consume the compound. Here is an example in which the compound is considered âpotentially-exchangedâ because only one genome (the purple one) can produce the compound, while both genomes have consumption reactions for it (purple and green arrows):
Essentially, the program examines the potential transfer point of the metabolite, meaning the point during the chemical transformations of the compound at which it could be transferred from one genome to another. However, there are two different strategies for examining this transfer point and its surrounding network of reactions, which result in predictions with different levels of accuracy and confidence. These two approaches are described in the next two sections (in order of their complexity).
This approach implements almost exactly the âtransfer pointâ examination described in the previous section: for a given compound, we subset the merged reaction network to keep only the reactions immediately surrounding the compound (ie, production and/or consumption reactions). This allows us to designate the role of each genome using the following rules:
Then we can check if only one genome can participate in those reactions (a âuniqueâ compound), if the production or consumption is one-sided (a âpotentially-exchangedâ compound), or if we have one of those other situations from the table that we ignore (no prediction).
Advantages of this approach:
Caveats to this approach:
--custom-equivalent-compounds-file
to match all of the compound IDs that should be considered equivalent to each other. However, the scale of ModelSEED was too large for us to automatically identify matches for anything other than common amino acids (see the flag --use-equivalent-amino-acids
described above).Bottom line: you have to be extra careful with interpreting any predictions labeled with the Reaction_Network_Subset
prediction method in the output of this program.
Provided you run anvi-predict-metabolic-exchanges
without the --pathway-walk-only
flag, it will use the Reaction Network Subset approach. Normally, this strategy is the second stage of prediction. If you allow the Pathway Map Walk strategy to run, then that will run first to make a limited yet likely more accurate set of predictions for any compounds that participate in a KEGG Pathway Map. Afterwards, the Reaction Network Subset approach will process any reminaing compounds from the merged reaction network that do not participate in a KEGG Pathway Map.
If the Pathway Map Walk strategy is skipped (for instance, with the --no-pathway-walk
), then the Reaction Network Subset approach will process all compounds from the merged reaction network.
This prediction approach is more complex, utilizes additional information from the KEGG PATHWAY database, and adds evidence associated with each prediction that enables one to filter for more reasonable-looking results.
It starts by associating compounds from the merged reaction network with KEGG Pathway Maps, and for each of the resulting Pathway Maps, it finds all possible âwalksâ through a Map for a given compound within the context of each genomeâs reaction network. Briefly, a âwalkâ starts from a compound and chains together the set of subsequent (or previous) reactions that are present in the genomeâs network, until it finds a reaction that the genome cannot catalyze (due to the absence of required enzymes). A âwalkâ can be done in either the âproductionâ direction (all reactions leading up to the production of the compound in the current organism) or the âconsumptionâ direction (all reactions leading away from the consumption of the compound in the current organism), and results in one or more reaction chains that describe the possible ways this metabolite can be generated or utilized by the organism. When there are branch points in the Pathway Map, all these branches are fully explored (potentially resulting in multiple reaction chains for either âproductionâ or âconsumptionâ of the compound). By default, we donât allow any gaps in the chain (meaning that we stop exploring a branch once we reach a reaction that is not present in the reaction network), but you can change this using the parameter --maximum-gaps
, in which case the walk will continue past these gaps until it reaches the maximum allowed number of missing reactions. The resulting reaction chains are specific to a given compound, Pathway Map, genome, and direction (âproductionâ vs âconsumptionâ).
To summarize, this Pathway Walk strategy gives us âproductionâ reaction chains and âconsumptionâ reaction chains for a compound within each organism. For example, if the compound in question is urea, and we are walking over Pathway Map map00230 (Purine Metabolism), we might end up with the following set of reaction chains:
In this example, the purple arrows represent the reactions found in the predicted âproducerâ organism (who can generate urea from urate using a production chain of 5 reactions, and who cannot consume urea) while the green arrows represent the reactions found in the predicted âconsumerâ organism. The consumer does not have a production chain because there are a lot of gaps in its network (in fact, we would not find a production chain in this genome unless we set --maximum-gaps
to at least 3). However, it does have a consumption reaction chain of length 1 (the reaction converting urea to bicarbonate).
The Pathway Map walking functionality works directly with KGML files and was recently implemented by Dr. Sam Miller. It is a powerful subsystem of anviâo that will hopefully be further described elsewhere in our documentation pages eventually. In the meantime, for the technically curious, you can check out the code in the kgml.py
and kgmlnetworkops.py
modules in the anviâo Github repository.
Prediction of potential exchanges from the Pathway Map walk results is then fairly straightforward:
And from there, we can decide which situation the compound falls into based on our prediction table, just like we do in the Reaction Network Subset approach. But unlike the Reaction Network Subset approach, we do not stop at prediction.
The final step of this approach is using the reaction chains returned by the Pathway Map walk to compute evidence that supports or opposes each prediction. We calculate the length of the longest production chain in the âprimary producerâ of the compound, and compare this chain to any production chains in the âprimary consumerâ to compute how much overlap there is (if any). We do the same thing on the other side of the transfer point, by computing the length of the longest consumption chain in the âprimary consumerâ and evaluating the overlap from consumption chains in the âprimary producerâ. Overlap is described in both length form (number of overlapping reactions) and proportion form (relative to the length of the reaction chain). You can see examples of these values in the example reaction chain image above.
If a compound is found in multiple Pathway Maps, we aggregate the evidence from all of those maps and report the most informative Pathway Maps (one for production, one for consumption) in the main *-potentially-exchanged-compounds.txt
output file. âMost informativeâ here means the map with the longest reaction chain. And if there are multiple maps with the same maximum chain length, then we report the map with the smallest (real-number) proportion of overlap (âreal-numberâ just means we try to avoid reporting None overlap values when possible). Detailed evidence from all Pathway Maps considered for a given compound is reported in the *-evidence.txt
output file in case you need to take a closer look.
Advantages of this approach:
Caveats to this approach:
--debug
flag to see if any Pathway Maps caused issues).Bottom Line: you should probably focus on predictions from the prediction method Pathway_Map_Walk
whenever possible
Provided you run anvi-predict-metabolic-exchanges
without the --no-pathway-walk
flag, it will use the Pathway Map Walk approach first to make predictions for any compound associated with at least one KEGG Pathway Map.
In order for the Pathway Map walk to work, we need the map to (1) include Kegg Orthologs (KOs) so that we can link up the reaction network information with the Pathway Map information, and (2) have a Reaction (âRNâ) type KGML file available in the KEGG data directory. This means we only process regular KEGG Pathway Maps categorized under âMetabolismâ, whose map IDs start with the prefix â00â â you can see a list of all Maps on the KEGG PATHWAY webpage if you are curious what falls under this category. And out of those maps, we have to skip any map that doesnât have an âRNâ-type KGML file. The program detects and skips any non-âRNâ maps automatically and warns the user on the terminal about them.
If you want to exclude a subset of these KEGG Pathway Maps, you can do so with the --exclude-pathway-maps
as described previously on this page.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.