This is a driver program for anvi-script-enrichment-stats
, a script that computes enrichment scores and group associations for annotated entities (ie, functions, KEGG Modules) across groups of genomes or samples..
Go back to the main page of anviâo programs and artifacts.
kegg-metabolism groups-txt misc-data-layers pan-db genomes-storage-db external-genomes internal-genomes
This program has multiple abilities. It can compute enriched functions across categories in a pangenome, enriched metabolic modules across groups of samples, or enriched functions across groups of genomes. To do this it relies on the script anvi-script-enrichment-stats
by Amy Willis.
Regardless of the situation, it returns a matrix of things that are enriched within specific groups in your dataset, as a functional-enrichment-txt file.
In this case, this program will return a matrix of functions that are enriched within specific groups in your pangenome.
You provide a pan-db and genomes-storage-db pair, as well as a misc-data-layers that stores categorical data, and the program will consider each of the categories their own âpan-groupâ. It will then find functions that are enriched in that group (i.e., functions that are associated with gene clusters that are characteristic of the genomes in that group). It returns this output as a functional-enrichment-txt.
Note that your genomes-storage-db must have at least one functional annotation source for this to work.
This helps you highlight functions or pathways that distinguish a specific pan-group and determine the functional core of your pangenome. For example, in the Prochlorococcus pangenome (the one used in the pangenomics tutorial, where you can find more info about this program), this program finds that Exonuclease VII
is enriched in the low-light pan-group. The output file provides various statistics about how confident the program is in making this association.
What this program does can be broken down into three steps:
If youâre still curious, check out Alonâs behind the scenes post, which goes into a lot more detail.
Here is the simplest run of this program:
anvi-compute-functional-enrichment -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source FUNCTION_SOURCE
You must provide this program with a pan-db and its corresponding genomes-storage-db. You must also provide an output file name.
The pan-db must contain at least one categorical data layer in misc-data-layers, and you must choose one of these categories to define your pan-groups with the --category-variable
parameter. Note that by default any genomes not in a category will be ignored; you can instead include these in the analysis by using the flag --include-ungrouped
.
The genomes-storage-db must have at least one functional annotation source, and you must choose one of these sources with the --annotation-source
. If you do not know which functional annotation sources are available in your genomes-storage-db, you can use the --list-annotation-sources
parameter to find out.
By default, gene clusters with the same functional annotation will be merged. But if you provide the --include-gc-identity-as-function
parameter and set the annotation source to be âIDENTITYâ, anviâo will treat gene cluster names as functions and enable you to investigate enrichment of each gene cluster independently. This is how you do it:
anvi-compute-functional-enrichment -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source IDENTITY \ --include-gc-identity-as-function
To output a functional occurrence table, which describes the number of times each of your functional associations occurs in each genome youâre looking at, use the --functional-occurrence-table-output
parameter, like so:
anvi-compute-functional-enrichment -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source FUNCTION_SOURCE \ --functional-occurrence-table-output FUNC_OCCURRENCE.TXT
You can interact more with this data file by using anvi-matrix-to-newick. Find more information about this output option here.
This option computes enrichment scores for metabolic modules in groups of samples. In order to do this, you must already have estimated completeness of metabolic modules in your samples using anvi-estimate-metabolism and obtained a âmodulesâ mode output file (the default). You must provide that file to this program along with a groups-txt file indicating which samples belong to which groups.
--module-completion-threshold
) will be considered to be present in that sample.anvi-script-enrichment-stats
, and it produces a functional-enrichment-txt file.See kegg-metabolism for more information on the âmodulesâ mode output format from anvi-estimate-metabolism, which you must provide with the -M
flag. The sample names in this file must match those in the groups-txt file, provided with -G
. You must also provide the name of the output file.
anvi-compute-functional-enrichment -M MODULES.TXT \ -G groups-txt \ -o functional-enrichment-txt
The default completeness threshold for a module to be considered âpresentâ in a sample is 0.75 (75 percent). If you wish to change this, you can do so by providing a different threshold - as a number in the range (0, 1] - using the --module-completion-threshold
parameter. For example:
anvi-compute-functional-enrichment -M MODULES.TXT \ -G groups-txt \ -o functional-enrichment-txt \ --module-completion-threshold 0.9
By default, the column containing sample names in your MODULES.TXT file will have the header db_name
, but there are certain cases in which you might have them in a different column - for example, if you did not run anvi-estimate-metabolism in multi-mode. In those cases, you can specify that a different column contains the sample names by providing its header with --sample-header
. For example, if you sample names were in the metagenome_name
column, you would do the following:
anvi-compute-functional-enrichment -M MODULES.TXT \ -G groups-txt \ -o functional-enrichment-txt \ --sample-header metagenome_name
If you ran anvi-estimate-metabolism on a bunch of extra samples but only want to include a subset of those samples in the groups-txt, that is fine - by default any samples from the MODULES.TXT file that are missing from the groups-txt will be ignored. However, there is also an option to include those missing samples in the analysis, as one big group called âUNGROUPEDâ. To do this, you can use the âinclude-samples-missing-from-groups-txt parameter. Just be careful that if you are also using the âinclude-ungrouped flag (see below), any samples without a specified group in the groups-txt will also be included in the âUNGROUPEDâ group.
anvi-compute-functional-enrichment -M MODULES.TXT \ -G groups-txt \ -o functional-enrichment-txt \ --include-samples-missing-from-groups-txt
You are not limited to computing functional enrichment in pangenomes, you can do it for regular genomes, too. This option takes either external or internal genomes (or both) which are organized into groups, and computes enrichment scores and associated groups for annotated functions in those genomes.
This is similar to computing functional enrichment in pangenomes (as described above), but a bit simpler.
anvi-script-enrichment-stats
to fit a GLM to determine A) the level that a particular functional annotation is unique to a single group and B) the percent of genomes it appears in in each group. This produces a functional-enrichment-txt file.You can provide either an external-genomes file or an internal-genomes file or both, but no matter what these files must contain a group
column which indicates the group that each genome belongs to. Similar to option 1, you must also provide an annotation source from which to extract the functional annotations of interest. In the example below, we provide both types of input files.
anvi-compute-functional-enrichment -i internal-genomes\ -e external-genomes \ -o functional-enrichment-txt \ --annotation-source FUNCTION_SOURCE
Also similar to option 1, you can get a tab-delimited matrix describing the occurrence (counts) of each function within each genome using the --functional-occurrence-table-output
parameter:
anvi-compute-functional-enrichment -i internal-genomes\ -e external-genomes \ -o functional-enrichment-txt \ --annotation-source FUNCTION_SOURCE --functional-occurrence-table-output FUNC_OCCURRENCE.TXT
If you provide the --include-ungrouped
parameter, then genomes (or samples) without a group will be included from the analysis. (By default, these genomes/samples are ignored.) For the pangenome case, these genomes are those without a category in the provided --category-variable
. For metabolic modules or the genomes in groups case, these samples/genomes are those with an empty value in the âgroupâ column (of either the groups-txt or the external-genomes/internal-genomes files).
anvi-script-enrichment-stats
This program serves as the interface to anvi-script-enrichment-stats
, an R script which performs an enrichment test on your input. You will find a brief description of how this script works in Alonâs âBehind the Scenesâ note in the pangenomics tutorial. Better yet, check out the methods section of Alonâs paper, published in Genome Biology here.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.