anvi-pan-genome

Authors
Can consume
Can provide
Usage

Making sure your installation can do pangenomics
A brief summary
The ‘additional parameters’ mechanism for power users

Additional Resources

An anvi'o program to compute a pangenome from an anvi'o genome storage.

🔙 To the main page of anvi’o programs and artifacts.

Authors

A. Murat Eren (Meren)

Can consume

genomes-storage-db gene-clusters-txt

Can provide

pan-db misc-data-items-order gene-clusters

Usage

This program implements pangenomics, and organizes genes found within a genomes-storage-db to create a pan-db.

Please first read the pangenomics tutorial to have a better understanding of the steps that lead to the generation of a pan-db.

Making sure your installation can do pangenomics

You can always test if your computer has all the dependencies for a successful pangenomics analysis by running,

anvi-self-test --suite pangenomics

If it runs without errors, you’re golden. If not, please consult with the most up-to-date installation instructions for anvi’o and get in touch with the anvi’o community for guidance.

A brief summary

The program anvi-pan-genome performs three major things for its user:

Calculates the similarity between the all gene amino acid seqeunces found in genomes described in your genomes-storage-db using DIAMOND. You have some options. Although, (1) you can use the NCBI’s BLAST program blastp instead of DIAMOND using the --use-ncbi-blast flag, (2) instead of analyzing all genomes you can focus a subset using the --genome-names parameter, and (3) exclude genes that are partial from your analysis using the flag --exclude-partial-gene-calls if you think you must.
Resolves gene clusters using the BLAST results via the MCL algorithm after discarding weak hits from the search results using the --minbit heuristic (inspired by the workflow implemented by ITEP (Benedict et al., 2014).
Performs additional analyses of gene clusters for downstream analyses and visualization tasks. These analyses include,
- Multiple sequnce alignment of amino acid sequences in each gene cluster,
- Computation of functional and geometric homogeneity indices,
- Computation of average amino-acid identity (AAI) within each gene cluster,
- Hierarchical clustering analysis of gene clusters based on their distribution across genomes, and genomes based on their sharing of the gene pool.

The basic command line to run a pangenomic analysis that will do all the step above will look like the following:

anvi-pan-genome --genomes-storage genomes-storage-db \ --project-name PROJECT_NAME

But it is also possible for power users to initiate the anvi’o pangenomcs workflow with user defined gene clusters, which means it would be possible to visualize the pangenomic analyses performed by other tools in anvi’o. In this case, only the third step is performed with already established gene clusters:

anvi-pan-genome --genomes-storage genomes-storage-db \ --project-name PROJECT_NAME \ --gene-clusters-txt gene-clusters-txt

The ‘additional parameters’ mechanism for power users

At the core of the pangenomics workflow lies the reciprocal BLAST search that identifies sequence similarities within a pool of gene sequences. For this, anvi’o uses DIAMOND by default, but the user can change the search algorithm. Based on the algorithm used for this step, the matching anvi’o driver sets some default parameters for a successful run. Such as the proper parameter to explicitly define where the output files generated by DIAMOND should go, and so on. Apart from those mandatory parameters that are critical for a successful run, anvi’o allows the user to define a set of additional parameters to pass to the search algorithm.

This is done via the flag --additional-params-for-seq-search. For instance, the user could take a look at the parameters diamond offers by typing diamond help on their terminal, and may decide to use the --sensitive implemented by DIAMOND to enable a slower but more sensitive search, and use the parameter --id 98 to ask DIAMOND to not report any hits across genes that is lower than 98% sequence identity to limit gene clusters only those sequences that are extremely closely related while pushing everything else to be singletons (which can also be removed from the analysis with a separate --min-occurrence 2 flag anvi-pan-genome accepts). They can pass these parameters to DIAMOND by running their analysis the following way:

anvi-pan-genome --genomes-storage genomes-storage-db \ --project-name PROJECT_NAME \ --additional-params-for-seq-search “--masking 0 --sensitive --id 98”

The additional parameters used for the search will be stored in the resulting pan-db and can be viewed anytime using the program anvi-display-pan.

For DIAMOND, if no additional parameters is declared, anvi’o will include --masking 0 by default since we recently learned that not using that flag leads to the elmination of genes with many repeated elements (see #1955).

With the freedom of additional parameters for sequnce search, it is possible to make significant mistakes since anvi’o will have no opportunity to sanity-check user-defined additional parameters. If you are doing something experimental, please keep an eye on the output messages and error logs.

If the user choses to use NCBI’s BLAST program, in that case anvi’o will pass the value of the parameter --additional-params-for-seq-search to NCBI’s blastp.

Edit this file to update this information.

Additional Resources

A tutorial on pangenomics

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.