The anvi'o 'phylogenomics' workflow

Authors
Artifacts accepted
Artifacts produced
Third party programs
Workflow description and usage
Required input
Run it
Output structure
Notes

Infer a phylogeny from homologous protein sequences

This workflow is for users who want to infer a phylogeny from homologous protein sequences extracted from anvi'o genome inputs. It uses the contigs workflow to prepare the underlying contigs databases, retrieves single-copy core genes or other HMM hits from those genomes, aligns the sequences, trims the alignment, and generates a phylogenetic tree that can be used downstream as a phylogeny artifact.

🔙 To the main page of anvi’o programs and artifacts.

Authors

Kathryn Kananen

Artifacts accepted

The phylogenomics can typically be initiated with the following artifacts:

workflow-config internal-genomes external-genomes

Artifacts produced

The phylogenomics typically produce the following anvi’o artifacts:

phylogeny

Third party programs

This is a list of programs that may be used by the phylogenomics workflow depending on the user settings in the workflow-config :

trimal (Trim multiple sequence alignment)
IQ-TREE (Calculate phylogenetic tree)

An anvi’o installation that follows the recommendations on the installation page will include all these programs. But please consider your settings, and cite these additional tools from your methods sections.

Workflow description and usage

The phylogenomics workflow starts with one or more contigs-db files and extracts a set of genes defined by HMMs. It then concatenates the recovered sequences, aligns them, trims the alignment, and infers a phylogenomic tree.

The workflow is meant for cases where you want to build a tree from homologous genes already annotated in anvi’o contigs databases. It can use internal genomes, external genomes, or a mix of both, as long as they are provided through the workflow config.

Required input

The phylogenomics workflow requires a workflow-config file. You can generate a default config like this:

anvi-run-workflow -w phylogenomics \ --get-default-config config.json

The workflow config typically includes:

project_name, which is used as the prefix for workflow outputs.
internal_genomes and/or external_genomes, which point to the genomes or metagenomes that should be used.
Parameters for anvi_get_sequences_for_hmm_hits, trimal, and iqtree.

An example minimal config looks like this:

{
    "workflow_name": "phylogenomics",
    "config_version": "3",
    "project_name": "phylo_project",
    "internal_genomes": "internal-genomes.txt",
    "external_genomes": "external-genomes.txt",
    "anvi_get_sequences_for_hmm_hits": {
        "--return-best-hit": true,
        "--align-with": "famsa",
        "--concatenate-genes": true,
        "--get-aa-sequences": true,
        "--hmm-sources": "Bacteria_71"
    },
    "trimal": {
        "-gt": 0.5
    },
    "iqtree": {
        "threads": 8,
        "-m": "WAG",
        "-bb": 1000
    },
    "output_dirs": {
        "PHYLO_DIR": "01_PHYLOGENOMICS",
        "LOGS_DIR": "00_LOGS"
    }
}

The project_name is mandatory. The workflow uses it to name the output FASTA, alignment, and tree files.

Run it

Create a workflow graph first if you want to inspect the plan:

anvi-run-workflow -w phylogenomics \ -c config.json \ --save-workflow-graph

Then run the workflow:

anvi-run-workflow -w phylogenomics \ -c config.json

If everything completes successfully, you should end up with a concatenated amino acid FASTA, a trimmed alignment, and a final tree in the phylogenomics output directory.

Output structure

The workflow writes its main outputs under 01_PHYLOGENOMICS/ by default.

Typical files include:

01_PHYLOGENOMICS/
├── PROJECT-proteins.fa
├── PROJECT-proteins_GAPS_REMOVED.fa
└── PROJECT-proteins_GAPS_REMOVED.fa.contree

The intermediate files represent the main stages:

anvi-get-sequences-for-hmm-hits extracts the target proteins.
trimal trims the multiple sequence alignment.
iqtree infers the phylogenomic tree.

Workflow logs are written under 00_LOGS/phylogenomics by default. Logs are organized by rule name, and the workflow also writes a tab-delimited manifest named 00_LOGS/phylogenomics/phylogenomics-workflow-manifest.tsv that records whether each job succeeded or failed and points to the relevant rule log.

Notes

This workflow inherits the contigs workflow, so the same contigs database setup and log organization conventions apply. If you are building your phylogeny from anvi’o HMM hits, the --return-best-hit and --concatenate-genes settings are usually the important ones to review first.

Edit this file to update this information.