An anvi'o program to compute a pangenome from an anvi'o genome storage.
🔙 To the main page of anvi’o programs and artifacts.
This program finds, clusters, and organizes the genes within a genomes-storage-db to create a pan-db.
This is the program that does the brunt of the work when running a pangenomic workflow. Check out the pangenomic tutorial for a more in-depth overview of the contents of this page and the capabilities of a pan-db.
Before running this program, you’ll want to make sure your dependencies are all set, since this program requires some aditional dependencies than base anvi’o. If the following command runs without errors, then you’re all good.
anvi-self-test --suite pangenomics
If that command doesn’t run smoothly, check out this page.
This program finds and organizes your gene clusters to give you all of the data that is displayed when you run anvi-pan-genome. Almost all of the work described in this gif that explains the common steps involved in pangenomics is done by this program.
In a little more detail, this program will do three major things for you:
Calculate the similarity between the all of the gene calls in all of the genomes in your genomes-storage-db. By default this uses DIAMOND to do this, but Meren strongly recommends that you use the --use-ncbi-blast
flag to use blastp
instead.
When doing this, this will look at every genome in your genomes-storage-db (unless you use --genome-names
) and will use every gene call, whether or not they are complete (unless you used --exclude-partial-gene-calls
).
After doing this, it will use the minbit heuristic (originally from ITEP (Benedict et al., 2014) to throw out weak matches. This removes a lot of noise before clustering.
Use the MCL algorithm to identify clusters in your search results.
Organize your gene clusters and genomes using their euclidean
distance and ward
linkage.
This program is very smart, and if you’re already run it, it will try to use the data that it’s already calculated. This way you can change smaller parameters without all of the run time. However, this also means you need to tell it to rerun the process (if that’s what you want) with the flag --overwrite-output-destinations
.
Who doesn’t love a good example? The simplest way to run this is as follows:
anvi-pan-genome -g genomes-storage-db
But there are many parameters you can alter to your liking. For example, here’s a run that specifies that it wants to use NCBI’s blastp to find sequence similarities and muscle to align genes and defines its output
anvi-pan-genomes -g genomes-storage-db \ --align-with muscle \ --use-ncbi-blast \ -n MY_PROJECT_NAME \ --description description.txt \ -o PATH/TO/pan-db
Here’s another example that only looks at the complete gene calls within a subset of the genomes, eliminates gene clusters that only have hits in a single genome, and uses DIAMOND but with the sensitive setting enabled:
anvi-pan-genomes -g genomes-storage-db \ -n MY_PROJECT_NAME \ --genome-names GENOME_1,GENOME_2,GENOME_3 \ --exclude-partial-gene-calls \ --min-occurance 2 \ --sensitive \ -o PATH/TO/pan-db
Some other parameters available to you allow you to
-T
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.