The anvi'o 'ecophylo' workflow

Co-characterize the biogeography and phylogeny of any protein

The ecophylo workflow explores the ecological and phylogenetic relationships between individual genes and environments. Briefly, the workflow extracts a target gene from any set of FASTA files (e.g., isolate genomes, MAGs, SAGs, or simply assembled metagenomes) using a user-defined HMM, and offers an integrated access to the phylogenetics of matching genes, and their distribution across environments.

🔙 To the main page of anvi’o programs and artifacts.

Authors

Artifacts accepted

The ecophylo can typically be initiated with the following artifacts:

workflow-config samples-txt hmm-list external-genomes metagenomes

Artifacts produced

The ecophylo typically produce the following anvi’o artifacts:

contigs-db profile-db

Third party programs

This is a list of programs that may be used by the ecophylo workflow depending on the user settings in the workflow-config :

  • Bowtie2 (Read recruitment)
  • MMseqs2 (Cluster open reading frames)
  • muscle (Align protein sequences)
  • trimal (Trim multiple sequence alignment)
  • IQ-TREE (Calculate phylogenetic tree)
  • FastTree (Calculate phylogenetic tree)
  • HMMER (Search for homologous sequences)

An anvi’o installation that follows the recommendations on the installation page will include all these programs. But please consider your settings, and cite these additional tools from your methods sections.

Workflow description and usage

The ecophylo workflow starts with a user-defined target gene (HMM) and a list of assembled genomes and/or metagenomes and results in an interactive interface that includes (1) a phylogenetic analysis of all genes found in genomes and metagenomes that match to the user-defined target gene, and (2) the distribution pattern of each of these genes across metagenomes if the user provided metagenomic short reads to survey.

The user-defined target genes can be described by an hmm-list. Furthermore, the assemblies of genomes and/or metagenomes to search these genes can be passed to the workflow via the artifacts external-genomes and metagenomes, respectively. Finally, the user can also provide a set of metagenomic short reads via the artifact samples-txt to recover the distribution patterns of genes.

In a standard run, ecophylo first identifies matching genes based on their HMMs, then clusters them based on sequence similarity at a threshold defined by the user, and finally selects a representative sequence from each cluster that contains more than two genes. Next, ecophylo calculates a phylogenetic tree to infer evolutionary associations between these sequences to produce a NEWICK-formatted dendrogram. If the user provided a samples-txt for metagenomic read recruitment, the workflow will also perform a read recruitment step to recover and store coverage statistics of the final set of genes for downstream analyses in the form of a profile-db. The completion of the workflow will yield all files necessary to explore the results through an anvi’o interactive interface and investigate associations between ecological and evolutionary relationships between target genes. The workflow can use any HMM that models amino acid sequences. Using single-copy core genes such as Ribosomal Proteins will yield taxonomic profiles of metagenomes de facto.

The ecophylo workflow has 2 modes which can be designated in the workflow-config by changing the input files that are provided: tree-mode and profile-mode. In tree-mode, the sequences will be used to calculate a phylogenetic tree. In profile-mode, the sequences will be used to calculate a phylogenetic tree and be additionally profiled via read recruitment across user-provided metagenomes.

Required input

The ecophylo workflow requires the following files:

  • workflow-config: This allows you to customize the workflow step by step. Here is how you can generate the default version:

anvi-run-workflow -w ecophylo \ --get-default-config config.json

Here is a tutorial walking through more details regarding the ecophylo workflow-config file: coming soon!

tree-mode: Insights into the evolutionary patterns of target genes

This is the simplest implementation of ecophylo where only an amino acid based phylogenetic tree is calculated. The workflow will extract the target gene from input assemblies, cluster and pick representatives, then calculate a phylogenetic tree based on the amino acid representative sequences. There are two sub-modes of tree-mode which depend on how you pick representative sequences, NT-mode or AA-mode where extracted genes associated nucleotide version (NT) or the amino acid (AA) can be used to cluster the dataset and pick representatives, respectively.

NT-mode

Cluster and select representative genes based on NT sequences.

This is the default version of tree-mode where the extracted gene sequences are clustered based on their associated NT sequences. This is done to prepare for profile-mode, where adequate sequence distance is needed between gene NT sequences to prevent non-specific-read-recruitment. The translated amino acid versions of the NT sequence clusters are then used to calculate an AA based phylogenetic tree. This mode is specifically useful to see what the gene phylogenetic tree will look like before the read recruitment step in profile-mode, (for gene phylogenetic applications of ecophylo please see AA-mode). If everything looks good you can add in your samples-txt and continue with profile-mode to add metagenomic read recruitment results.

Here is what the start of the ecophylo workflow-config should look like if you want to run tree-mode:

{
    "metagenomes": "metagenomes.txt",
    "external_genomes": "external-genomes.txt",
    "hmm_list": "hmm_list.txt",
    "samples_txt": ""
}

AA-mode

Cluster and select representative genes based on AA sequences. If you are interested specifically in gene phylogenetics, this is the mode for you!

This is another sub-version of tree-mode where representative sequences are chosen via AA sequence clustering.

To initialize AA-mode, go to the rule cluster_X_percent_sim_mmseqs in the ecophylo workflow-config and turn “AA_mode” to true:

{
    "metagenomes": "metagenomes.txt",
    "external_genomes": "external-genomes.txt",
    "hmm_list": "hmm_list.txt",
    "samples_txt": ""
    "cluster_X_percent_sim_mmseqs": {
        "AA_mode": true,
    }
}

Be sure to change the --min-seq-id of the cluster_X_percent_sim_mmseqs rule to the appropriate clustering threshold depending if you are in NT-mode or AA-mode.

profile-mode: Insights into the ecological and evolutionary patterns of target genes and environments

profile-mode, is an extension of default tree-mode (NT-mode) where NT sequences representatives are profiled with metagenomic reads from user provided metagenomic samples. This allows for the simultaneous visualization of phylogenetic and ecological relationships of genes across metagenomic datasets.

Additional required files:

To initialize profile-mode, , add the path to your samples-txt to your ecophylo workflow-config:

{
    "metagenomes": "metagenomes.txt",
    "external_genomes": "external-genomes.txt",
    "hmm_list": "hmm_list.txt",
    "samples_txt": "samples.txt"
}

Config file options

Ecophylo will sanity check all input files that contain contigs-dbs before the workflow starts. This can take a while especially if you are working with 1000’s of genomes. If you want to skip sanity checks for contigs-dbs in your external-genomes and/or metagenomes then adjust your config to the following:

{
    "run_genomes_sanity_check": false
}

Edit this file to update this information.