Do cool stuff with gene clusters in anvi'o pan genomes.
Go back to the main page of anvi’o programs and artifacts.
genes-fasta concatenated-gene-alignment-fasta misc-data-items
This aptly-named program gets the sequences for the gene clusters stored in a pan-db and returns them as either a genes-fasta or a concatenated-gene-alignment-fasta (which you can use to run anvi-gen-phylogenomic-tree). This gives you advanced access to your gene clusters, which you can take out of anvi’o, use for phylogenomic analyses, or do whatever you please with.
You also have the option to output the sequences of your choice as a misc-data-items (with add-into-items-additional-data-table
), which can be added to the interactive interface as additional layers.
While the number of parameters may seem daunting, many of the options just help you specify exactly which gene clusters you want to get the sequences from.
Here is a basic run, that will export alignments for every single gene cluster found in the pan-db as amino acid sequences :
anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta
To get the DNA sequences instead, just add --report-DNA-sequences
.
You can export only the sequences for a specific collection or bin with the parameters -C
or -b
respectively. You also have the option to display the collections and bins available in your pan-db with --list-collections
or --list-bins
anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ -C collection
Alternatively, you can export the specific gene clusters by name, either by providing a single gene cluster ID or a file with one gene cluster ID per line. For example:
anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ --gene-cluster-ids-file gene_clusters.txt
where gene_clusters.txt
contains the following:
GC_00000618
GC_00000643
GC_00000729
These parameters are used to exclude gene clusters that don’t reach certain thresholds and are applies on top of filters already applied (for example, you can use these to exclude clusters within a specific bin).
Here is a list of the different filters that you can use to exclude some subsection of your gene clusters:
max-num-genes-from-each-genome
to 1.For example, the following run on a genomes-storage-db that contains 50 genomes will report only the single-copy core genes with a functional homogenity index above 0.25:
anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ --max-num-genes-from-each-genome 1 \ --min-num-genomes-gene-cluster-occurs 50 \ --min-functional-homogenity-index 0.25
You can also exclude genomes that are missing some number of the gene clusters that you’re working with by using the paramter --max-num-gene-clusters-missing-from-genome
.
For each of these parameters, see the program’s help menu for more information.
To get a concatenated-gene-alignment-fasta (which you can use to run anvi-gen-phylogenomic-tree), use the parameter --concatenate-gene-clusters
anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ --concatenate-gene-clusters
Here, you also have the option to specify a specific aligner (or list the available aligners), as well as provide a NEXUS formatted partition file, if you so choose.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.