A program that computes functional enrichment within a pangenome..
đ To the main page of anviâo programs and artifacts.
misc-data-layers pan-db genomes-storage-db functions
This program computes functional enrichment within a pangenome and returns a functional-enrichment-txt file.
For its sister programs, see anvi-compute-metabolic-enrichment and anvi-compute-functional-enrichment-across-genomes.
Please also see anvi-display-functions which can both calculate functional enrichment, AND give you an interactive interface to display the distribution of functions.
For this to run, you must provide a pan-db and genomes-storage-db pair, as well as a misc-data-layers that associates genomes in your pan database with categorical data. The program will then find functions that are enriched in each group (i.e., functions that are associated with gene clusters that are characteristic of the genomes in that group).
Note that your genomes-storage-db must have at least one functional annotation source for this to work.
This analysis will help you identify functions that are associated with a specific group of genomes in a pangenome and determine the functional core of your pangenome. For example, in the Prochlorococcus pangenome (the one used in the pangenomics tutorial, where you can find more info about this program), this program finds that Exonuclease VII
is enriched in the low-light
genomes and not in high-light
genomes. The output file provides various statistics about how confident the program is in making this association.
What this program does can be broken down into three steps:
Determine groups of genomes. The program uses a misc-data-layers variable (containing categorical, not numerical, data) to split genomes in a pangenome into two or more groups. For example, in the pangenome tutorial, the categorical variable name was light
that partitioned genomes into low-light
and high-light
groups.
Determine the âfunctional associationsâ of gene clusters. In short, this is collecting the functional annotations for all of the genes in each cluster and assigning the one that appears most frequently to represent the entire cluster.
Quantify the distribution of functions in each group of genomes. For this, the program determines to what extent a particular function is enriched in specific groups of genomes and reports it as a functional-enrichment-txt file. It does so by running the script anvi-script-enrichment-stats
.
The script anvi-script-enrichment-stats
was implemented by Amy Willis, and described first in this paper.
Check out Alonâs behind the scenes post, which goes into a lot more detail.
Here is the simplest way to run this program:
anvi-compute-functional-enrichment-in-pan -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source FUNCTION_SOURCE
The pan-db must contain at least one categorical data layer in misc-data-layers, and you must choose one of these categories to define your pan-groups with the --category-variable
parameter. You can see available variables with anvi-show-misc-data program with the parameters -t layers --debug
.
Note that by default any genomes not in a category will be ignored; you can instead include these in the analysis by using the flag --include-ungrouped
.
The genomes-storage-db must have at least one functional annotation source, and you must choose one of these sources with the --annotation-source
. If you do not know which functional annotation sources are available in your genomes-storage-db, you can use the --list-annotation-sources
parameter to find out.
By default, gene clusters with the same functional annotation will be merged. But if you provide the --include-gc-identity-as-function
parameter and set the annotation source to be âIDENTITYâ, anviâo will treat gene cluster names as functions and enable you to investigate enrichment of each gene cluster independently. This is how you do it:
anvi-compute-functional-enrichment-in-pan -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source IDENTITY \ --include-gc-identity-as-function
To output a functional occurrence table, which describes the number of times each of your functional associations occurs in each genome youâre looking at, use the --functional-occurrence-table-output
parameter, like so:
anvi-compute-functional-enrichment-in-pan -p pan-db\ -g genomes-storage-db \ -o functional-enrichment-txt \ --category-variable CATEGORY \ --annotation-source FUNCTION_SOURCE \ --functional-occurrence-table-output FUNC_OCCURRENCE.TXT
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.