A program that computes functional enrichment within a pangenome..
🔙 To the main page of anvi’o programs and artifacts.
misc-data-layers pan-db genomes-storage-db functions
This program computes functional enrichment within a pangenome and returns a functional-enrichment-txt file.
For its sister programs, see anvi-compute-metabolic-enrichment and anvi-compute-functional-enrichment-across-genomes.
Please also see anvi-display-functions which can both calculate functional enrichment, AND give you an interactive interface to display the distribution of functions.
For this to run, you must provide a pan-db and genomes-storage-db pair, as well as a misc-data-layers that associates genomes in your pan database with categorical data. The program will then find functions that are enriched in each group (i.e., functions that are associated with gene clusters that are characteristic of the genomes in that group).
Note that your genomes-storage-db must have at least one functional annotation source for this to work.
This analysis will help you identify functions that are associated with a specific group of genomes in a pangenome and determine the functional core of your pangenome. For example, in the Prochlorococcus pangenome (the one used in the pangenomics tutorial, where you can find more info about this program), this program finds that
Exonuclease VII is enriched in the
low-light genomes and not in
high-light genomes. The output file provides various statistics about how confident the program is in making this association.
What this program does can be broken down into three steps:
Determine groups of genomes. The program uses a misc-data-layers variable (containing categorical, not numerical, data) to split genomes in a pangenome into two or more groups. For example, in the pangenome tutorial, the categorical variable name was
light that partitioned genomes into
Determine the “functional associations” of gene clusters. In short, this is collecting the functional annotations for all of the genes in each cluster and assigning the one that appears most frequently to represent the entire cluster.
Quantify the distribution of functions in each group of genomes. For this, the program determines to what extent a particular function is enriched in specific groups of genomes and reports it as a functional-enrichment-txt file. It does so by running the script
anvi-script-enrichment-stats was implemented by Amy Willis, and described first in this paper.
Check out Alon’s behind the scenes post, which goes into a lot more detail.
Here is the simplest way to run this program:
anvi-compute-functional-enrichment-in-pan -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source FUNCTION_SOURCE
The pan-db must contain at least one categorical data layer in misc-data-layers, and you must choose one of these categories to define your pan-groups with the
--category-variable parameter. You can see available variables with anvi-show-misc-data program with the parameters
-t layers --debug.
Note that by default any genomes not in a category will be ignored; you can instead include these in the analysis by using the flag
The genomes-storage-db must have at least one functional annotation source, and you must choose one of these sources with the
--annotation-source. If you do not know which functional annotation sources are available in your genomes-storage-db, you can use the
--list-annotation-sources parameter to find out.
By default, gene clusters with the same functional annotation will be merged. But if you provide the
--include-gc-identity-as-function parameter and set the annotation source to be ‘IDENTITY’, anvi’o will treat gene cluster names as functions and enable you to investigate enrichment of each gene cluster independently. This is how you do it:
anvi-compute-functional-enrichment-in-pan -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source IDENTITY \ --include-gc-identity-as-function
To output a functional occurrence table, which describes the number of times each of your functional associations occurs in each genome you’re looking at, use the
--functional-occurrence-table-output parameter, like so:
anvi-compute-functional-enrichment-in-pan -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source FUNCTION_SOURCE \ --functional-occurrence-table-output FUNC_OCCURRENCE.TXT
Edit this file to update this information.
A description of the enrichment script run by this program can be found in Shaiber et al 2020
An example of pangenome functional enrichment in the context of the Prochlorococcus metapangenome from Delmont and Eren 2018 is included in the pangenomics tutorial
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.