anvi-get-sequences-for-gene-clusters [program]

Do cool stuff with gene clusters in anvi'o pan genomes.

Go back to the main page of anvi’o programs and artifacts.

Can provide

genes-fasta concatenated-gene-alignment-fasta misc-data-items

Can consume

pan-db genomes-storage-db

Usage

This aptly-named program gets the sequences for the gene clusters stored in a pan-db and returns them as either a genes-fasta or a concatenated-gene-alignment-fasta (which you can use to run anvi-gen-phylogenomic-tree). This gives you advanced access to your gene clusters, which you can take out of anvi’o, use for phylogenomic analyses, or do whatever you please with.

You also have the option to output the sequences of your choice as a misc-data-items (with add-into-items-additional-data-table), which can be added to the interactive interface as additional layers.

While the number of parameters may seem daunting, many of the options just help you specify exactly which gene clusters you want to get the sequences from.

Running on all gene clusters

Here is a basic run, that will export alignments for every single gene cluster found in the pan-db as amino acid sequences :

anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta

To get the DNA sequences instead, just add --report-DNA-sequences.

Exporting only specific gene clusters

Part 1: Choosing gene clusters by collection, bin, or name

You can export only the sequences for a specific collection or bin with the parameters -C or -b respectively. You also have the option to display the collections and bins available in your pan-db with --list-collections or --list-bins

anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ -C collection

Alternatively, you can export the specific gene clusters by name, either by providing a single gene cluster ID or a file with one gene cluster ID per line. For example:

anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ --gene-cluster-ids-file gene_clusters.txt

where gene_clusters.txt contains the following:

GC_00000618
GC_00000643
GC_00000729

Part 2: Choosing gene clusters by their attributes

These parameters are used to exclude gene clusters that don’t reach certain thresholds and are applies on top of filters already applied (for example, you can use these to exclude clusters within a specific bin).

Here is a list of the different filters that you can use to exclude some subsection of your gene clusters:

  • min/max number of genomes that the gene cluster occurs in.
  • min/max number of genes from each genome. For example, you could exclude clusters that don’t appear in every genome 3 times, or get single-copy genes by setting max-num-genes-from-each-genome to 1.
  • min/max geometric homogenity index
  • min/max functional homogenity index
  • min/max combined homogenity index

For example, the following run on a genomes-storage-db that contains 50 genomes will report only the single-copy core genes with a functional homogenity index above 0.25:

anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ --max-num-genes-from-each-genome 1 \ --min-num-genomes-gene-cluster-occurs 50 \ --min-functional-homogenity-index 0.25

You can also exclude genomes that are missing some number of the gene clusters that you’re working with by using the paramter --max-num-gene-clusters-missing-from-genome.

For each of these parameters, see the program’s help menu for more information.

Fun with phylogenomics!

To get a concatenated-gene-alignment-fasta (which you can use to run anvi-gen-phylogenomic-tree), use the parameter --concatenate-gene-clusters

anvi-get-sequences-for-gene-clusters -g genomes-storage-db \ -p pan-db \ -o genes-fasta \ --concatenate-gene-clusters

Here, you also have the option to specify a specific aligner (or list the available aligners), as well as provide a NEXUS formatted partition file, if you so choose.

Edit this file to update this information.

Additional Resources

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.