Export sequences from sequence sources and compute a similarity metric (e.g. ANI). If a Pan Database is given anvi'o will write computed output to misc data tables of Pan Database.
🔙 To the main page of anvi’o programs and artifacts.
external-genomes internal-genomes pan-db
This program uses the user’s similarity metric of choice to calculate the similarity between the input genomes.
The currently available programs for calculating similarity metrics include, chosen can be chosen with --program
:
The expected input is any combination of external-genomes, internal-genomes, and text files that contains paths to fasta files that describe each of your genomes. This is a tab-delimited file with two columns (name
and path
to the fasta files, each of which is assumed to be a single genome).
The program outputs a directory with genome-similarity data. The specific contents will depend on how similarity scores are computed (specified with --program
), but generally contains tab-separated files of similarity scores between genomes and related metrics.
You also have the option to provide a pan-db, in which case the output data will additionally be stored in the database as misc-data-layers and misc-data-layer-orders data. This was done in the pangenomic tutorial.
Here is an example run with pyANI from an external-genomes without any parameter changes:
anvi-compute-genome-similarity -e external-genomes \ -o path/for/genome-similarity \ --program pyANI
Parameters have been divided up based on which --program
you use.
You have the option to change any of the follow parameters:
The minimum alignment fraction (all percent identity scores lower than this will be set to 0). The default is 0.
If you want to keep alignments that are long, despite them not passing the minimum alignment fraction filter, you can supply a --significant-alignment-length
to override --min-alignment-fraction
.
You can change any of the following fastANI parameters:
The kmer size. The default is 16.
The fragment length. The default is 30.
The minimum number of fragments for a result to count. The default is 50.
You have the option to change the kmer-size
. This value should depend on the relationship between your samples. The default is 31 (as recommended by sourmash for genus-level distances, but we found that 13 most closely parallels the results from an ANI alignment.
You can also set the compression ratio for your fasta files. Decreasing this from the default (1000) will decrease sensitivity.
Once calculated, the similarity matrix is used to create dendrograms via hierarchical clustering, which are stored in the output directory (and in the pan-db, if provided). You can choose to change the distance metric or linkage algorithm used for this clustering.
If you’re getting a lot of debug/output messages, you can turn them off with --just-do-it
or helpfully store them into a file with --log-file
.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.