Co-characterize the biogeography and phylogeny of any protein
The ecophylo workflow explores the ecological and phylogenetic relationships between a gene family and the environment. Briefly, the workflow extracts a target gene from any set of FASTA files (e.g., isolate genomes, MAGs, SAGs, or simply assembled metagenomes) using a user-defined HMM, and offers an integrated access to the phylogenetics of matching genes, and their distribution across environments.
π To the main page of anviβo programs and artifacts.
The ecophylo can typically be initiated with the following artifacts:
workflow-config samples-txt hmm-list external-genomes metagenomes
The ecophylo typically produce the following anviβo artifacts:
This is a list of programs that may be used by the ecophylo workflow depending on the user settings in the workflow-config :
An anviβo installation that follows the recommendations on the installation page will include all these programs. But please consider your settings, and cite these additional tools from your methods sections.
The ecophylo workflow starts with a user-provided target gene family defined by an HMM and a list of assembled genomes and/or metagenomes. The final output is an interactive interface that includes (1) a phylogenetic analysis of all genes detected by the HMM in genomes and/or metagenomes, and (2) the distribution pattern of each of these genes across metagenomes if the user provided metagenomic short reads to survey.
The βuser-provided HMMβ is passed to ecophylo via the hmm-list file, and the input assemblies of genomes and/or metagenomes to query using the HMM are passed to the workflow via the files external-genomes and metagenomes, respectively. Finally, the user can also provide a set of metagenomic short reads via a samples-txt to recover the distribution patterns of genes across samples.
Ecophylo first identifies homologous genes based on the input HMM, clusters matching sequences based on a user-defined sequence similarity threshold, and finally selects a representative sequence from each cluster that contains more than two genes. The final set of representative genes are filtered for QC at multiple steps of the workflow which is discussed later in this document in the section βQuality control and processing of hmm-hitsβ. After this step, the ecophylo workflow can continue with one of two modes that the user defines in the workflow-config: The so-called tree-mode or the so-called profile-mode.
In the tree-mode, the user must provide an hmm-list and metagenomes and/or external-genomes, and the workflow will stop after extracting representative sequences and calculating a phylogenetic tree (without any insights into the ecology of sequences through a subsequent step of metagenomic read recruitment). In contrast, the profile-mode will require an additional file: samples-txt. In this mode the workflow will continue with the profiling of representative sequences via read recruitment across user-provided metagenomes to recover and store coverage statistics. The completion of the workflow will yield all files necessary to explore the results in downstream analyses to investigate associations between ecological and evolutionary relationships between target genes.
The ecophylo workflow can leverage any HMM that models amino acid sequences. If the user chooses an HMM for a single-copy core gene, such as ribosomal protein, the workflow will yield multi-domain taxonomic profiles of metagenomes de facto.
If you have never run an anviβo snakemake workflow, please checkout the anviβo snakemake workflow tutorial. This is where you can learn the basics about how anviβo leverages Snakemake to process data. In fact, the EcoPhylo workflow uses the anviβo metagenomics workflow to profile protein families across metagenomes. Two birds, two workflows?
The minimum requirements of the ecophylo workflow are the following:
anvi-run-workflow -w ecophylo \ --get-default-config config.json
The ecophylo workflow produces a ton of intermediate files that can be useful for you to explore your data! Here is a basic look at the directory structure after successfully running the workflow:
$ tree ECOPHYLO_WORKFLOW -L 1
βββ 00_LOGS
βββ 01_REFERENCE_PROTEIN_DATA
βββ 02_NR_FASTAS
βββ 03_MSA
βββ 04_SEQUENCE_STATS
βββ 05_TREES
βββ 06_MISC_DATA
βββ METAGENOMICS_WORKFLOW
Letβs dive into some key intermediate files!
01_REFERENCE_PROTEIN_DATA/
This directory contains data extracted from each individual contigs-db the user provides via the external-genomes and/or metagenomes files. (The wide part of the funnel)
ASSEMBLY-PROTEIN-external_gene_calls.tsv
: external-gene-calls for each open reading frame in the analyzed contigs-dbASSEMBLY-PROTEIN-external_gene_calls_renamed.tsv
: external-gene-calls renamed and subsetted for target proteinASSEMBLY-PROTEIN-hmm_hits.faa
and ASSEMBLY-PROTEIN-hmm_hits.fna
: target protein sequences extracted with anvi-get-sequences-for-hmm-hitsASSEMBLY-PROTEIN-hmm_hits_renamed.faa
and ASSEMBLY-PROTEIN-hmm_hits_renamed.fna
: fasta files with renamed headersASSEMBLY-PROTEIN-orfs.fna
: output fasta from anvi-get-sequences-for-gene-callsASSEMBLY-PROTEIN-reformat_report_AA.txt
and ASSEMBLY-PROTEIN-reformat_report_nt.txt
ASSEMBLY-dom-hmmsearch/
: this directory contains all the homologs extracted from your input contigs-db (genomes and/or metagenomic assembles) with the user-provided HMM. Here are some key files:
hmm.domtable
contains the raw output the domain hits table from hmmsearch
run by anvi-run-hmmshmm_hits.txt
is the hmm-hits table stored in the associated contigs-dbhmm_hits_filtered.txt
is the filtered version hmm_hits.txt
based on user-defined parameters02_NR_FASTAS/
This directory is where all the data from individual contigs-db is combined and contains all the protein family clustering information from the workflow. (The narrow part of the funnel)
PROTEIN-all.faa
and PROTEIN-all.fna
contains ALL the amino acid and nucleotide sequences that made it past the initial filtering steps and will be clusteredPROTEIN-mmseqs_NR_cluster.tsv
: mmseqs cluster output file. VERY helpful for looking inside clusters (column 1: representative sequence; column 2: cluster member)PROTEIN-references_for_mapping_NT.fa
: Input ORFs used for the metagenomics workflowPROTEIN-AA_subset.fa
: translated sequences from PROTEIN-references_for_mapping_NT.fa
PROTEIN-external_gene_calls_all.tsv
: external-gene calls file for PROTEIN-references_for_mapping_NT.fa
03_MSA/
This directory contains all the intermediate files from multiple sequences alignment steps.
PROTEIN-aligned.fa
: raw output from MSAPROTEIN_aligned_trimmed.fa
: trimmed MSA from trimal
PROTEIN_aligned_trimmed_filtered.fa
: subsetted MSA removing sequences with x > 50% gapsPROTEIN_gaps_counts.tsv
: Table counting number of gaps per sequence in MSA04_SEQUENCE_STATS/
This directory contains information regarding the number of sequences filtered at various steps of the workflow.
PROTEIN_stats.tsv
: tracks number of sequences filtered at different steps of the workflow. Here are definitions for each rule:
combine_sequence_data
: total number of sequences before clusteringcluster_X_percent_sim_mmseqs
: number of cluster representative sequencesremove_sequences_with_X_percent_gaps
: number of sequences left after filtering during MSA99_percent
: number of clusters after clustering at 99% nucleotide similarity98_percent
: number of clusters after clustering at 98% nucleotide similarity05_TREES/
Here we have all things phylogenetics in the workflow.
PROTEIN-PROFILE.db
:PROTEIN.nwk
: PROTEIN phylogenetic treePROTEIN_renamed.faa
:renamed PROTEIN phylogenetic tree fasta filePROTEIN_renamed.nwk
: renamed PROTEIN phylogenetic tree to be imported in Metagenomics workflow merged profilePROTEIN_renamed_all.faa
: renamed fasta file with ALL proteins06_MISC_DATA/
This directory contains miscellaneous data created from the flow to help you interpret the phylogeography of your target PROTEIN.
PROTEIN_misc.tsv
: This file contains basic information about each representative sequence in the workflow including:
PROTEIN_scg_taxonomy_data.tsv
and PROTEIN_estimate_scg_taxonomy_results-RAW-LONG-FORMAT.txt
are the output of anvi-estimate-scg-taxonomy and will only be there if you are explore the phylogeography of a compatible single-copy core gene
METAGENOMICS_WORKFLOW/
This directory contains the output of the EcoPhylo sequences profiled with metagenomes with the anviβo metagenomics workflow.
Hidden Markov Models are the crux of the ecophylo workflow and will determine the sensitivity and specificity of the gene family hmm-hits you seek to investigate. However, not all hmm-hits are created equal. Just how BLAST can detect spurious hits with high-scoring segment pairs, an HMM search can yield non-homologous hits as well. To address this, we have a series of parameters you can adjust in the workflow-config to fine tune the input set of hmm-hits that ecophylo will process.
The first step to removing bad hmm-hits is to filter out hits with low quality alignment coverage. This is done with the rule filter_hmm_hits_by_model_coverage
which leverages anvi-script-filter-hmm-hits-table. This tool uses the output of hmmsearch to filter out hits basedon the model and/or gene coverage. We recommend 80% model coverage filter for most cases. However, it is always recommended to explore the distribution of model coverage with any new HMM which will help you determine a proper cutoff (citation). To adjust this parameter, go to the filter_hmm_hits_by_model_coverage
rule and change the parameter --min-model-coverage
. You can also adjust the gene coverage by change the parameter --min-gene-coverage
. This can help remove ORFs with outlier lengths but completely depends on the HMM you are using.
Please consider exploring the distribution of alignment coverages before choosing HMM alignment coverage filtering values. Interproscan is a great way to visualize how publicly available HMMs align to proteins. Additionally, you can parse the domtblout files from hmmsearch to explore these values in high throughput.
{
"filter_hmm_hits_by_model_coverage": {
"threads": 5,
"--min-model-coverage": 0.8,
"--min-gene-coverage": 0.5,
"additional_params": ""
},
}
Some full gene length HMM models align to a single hmm-hit independently at different coordinates when there should only be one annotation. To merge these independent alignment into one HMM alignment coverage stat, set --merge-partial-hits-within-X-nts
to any distance between the hits for which you would like to merge and add it to the rule filter_hmm_hits_by_model_coverage
under additional_params
.
{
"filter_hmm_hits_by_model_coverage": {
"additional_params": "--merge-partial-hits-within-X-nts"
},
}
Genes predicted from genomes and metagenomes can be partial or complete depending on whether a stop and stop codon is detected. Even if you filter out hmm-hits with bad alignment coverage as discussed above, HMMs can still detect low quality hits with good alignment coverage and homology statistics due to partial genes. Unfortunately, partial genes can lead to spurious phylogenetic branches and/or inflate the number of observed populations or functions in a given set of genomes/metagenomes.
To remove partial genes from the ecophylo analysis, the user can assign true
for --filter-out-partial-gene-calls
parameter so that only complete open-reading frames are processed.
Below is the default settings in the ecophylo workflow-config file.
{
"filter_hmm_hits_by_model_coverage": {
"threads": 5,
"--min-model-coverage": 0.8,
"--filter-out-partial-gene-calls": true,
"additional_params": ""
},
}
However, maybe youβre a risk taker, a maverick explorer of metagenomes. Complete or partial you accept all genes and their potential tree bending shortcomings! In this case, set --filter-out-partial-gene-calls false
in the workflow-config.
Simultaneously exploring complete and partial ORFs will increase the distribution of sequence lengths and thus impact sequence clustering. By default, we used mmseqs "--cov-mode": 1
in the rule cluster_X_percent_sim_mmseqs
to help insure ORFs of all lengths properly cluster together. Please refer to the MMseqs2 user guide description of βcov-mode for more details and adjust the parameter to suite your scientific endeavors.
{
"filter_hmm_hits_by_model_coverage": {
"threads": 5,
"--min-model-coverage": 0.8,
"--filter-out-partial-gene-calls": false,
"additional_params": ""
},
"cluster_X_percent_sim_mmseqs": {
"threads": 5,
"--min-seq-id": 0.94,
"--cov-mode": 1,
"clustering_threshold_for_OTUs": [
0.99,
0.98,
0.97
],
"AA_mode": false
},
}
Now that you have fine tuned the gene family input into the ecophylo workflow, itβs time to decide what output best fits your science question at hand.
Itβs common that not all genomes or metagenomes will have the gene family of interest either due to it not being detect by the input HMM or filtered out during the QC steps. Please check this log file for contigs-db that did not contain your gene family of interest: 00_LOGS/contigDBs_with_no_hmm_hit_*.log
One step of ecophylo is to perform a multiple sequence alignment of the recruited homologs and depending on your application, this could recruit thousands of ORFs which make the MSA a challenging feat. By default, the ecophylo is designed for quick insights, and thus the workflow-config file uses MUSCLE parameters to perform a large MSA, swiftly:
"align_sequences": {
"threads": 5,
"additional_params": "-maxiters 1 -diags -sv -distance1 kbit20_3"
},
However, these parameters may not be optimal for your use case. For example, maybe you are trying to explore branches patterns of a specific protein family and would prefer to have mulitple interations of the MSA. Please explore the MUSCLE documentation to documentation customize the MSA step for your needs. You can replace the additional_params
with whatever MUSCLE parameters that are best for you.
This is the simplest implementation of ecophylo where only an amino acid based phylogenetic tree is calculated. The workflow will extract the target gene from input assemblies, cluster and pick representatives, then calculate a phylogenetic tree based on the amino acid representative sequences. There are two sub-modes of tree-mode which depend on how you pick representative sequences, NT-mode or AA-mode where extracted genes associated nucleotide version (NT) or the amino acid (AA) can be used to cluster the dataset and pick representatives, respectively.
Cluster and select representative genes based on NT sequences.
This is the default version of tree-mode where the extracted gene sequences are clustered based on their associated NT sequences. This is done to prepare for profile-mode, where adequate sequence distance is needed between gene NT sequences to prevent non-specific-read-recruitment. The translated amino acid versions of the NT sequence clusters are then used to calculate an AA based phylogenetic tree. This mode is specifically useful to see what the gene phylogenetic tree will look like before the read recruitment step in profile-mode, (for gene phylogenetic applications of ecophylo please see AA-mode). If everything looks good you can add in your samples-txt and continue with profile-mode to add metagenomic read recruitment results.
Here is what the start of the ecophylo workflow-config should look like if you want to run tree-mode:
{
"metagenomes": "metagenomes.txt",
"external_genomes": "external-genomes.txt",
"hmm_list": "hmm_list.txt",
"samples_txt": ""
}
Cluster and select representative genes based on AA sequences. If you are interested specifically in gene phylogenetics, this is the mode for you!
This is another sub-version of tree-mode where representative sequences are chosen via AA sequence clustering.
To initialize AA-mode, go to the rule cluster_X_percent_sim_mmseqs
in the ecophylo workflow-config and turn βAA_modeβ to true:
{
"metagenomes": "metagenomes.txt",
"external_genomes": "external-genomes.txt",
"hmm_list": "hmm_list.txt",
"samples_txt": ""
"cluster_X_percent_sim_mmseqs": {
"AA_mode": true,
}
}
Be sure to change the --min-seq-id
of the cluster_X_percent_sim_mmseqs
rule to the appropriate clustering threshold depending if you are in NT-mode or AA-mode.
PROTEIN="" # Replace with name of protein from hmm_list.txt
anvi-interactive -t 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.nwk \
-p 05_TREES/"${PROTEIN}"/"${PROTEIN}"-PROFILE.db \
--fasta 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.faa \
--manual
profile-mode, is an extension of default tree-mode (NT-mode) where NT sequences representatives are profiled with metagenomic reads from user provided metagenomic samples. This allows for the simultaneous visualization of phylogenetic and ecological relationships of genes across metagenomic datasets.
Additional required files:
To initialize profile-mode, , add the path to your samples-txt to your ecophylo workflow-config:
{
"metagenomes": "metagenomes.txt",
"external_genomes": "external-genomes.txt",
"hmm_list": "hmm_list.txt",
"samples_txt": "samples.txt"
}
To visualize the output of profile-mode, run anvi-interactive on the contigs-db and profile-db located in the METAGENOMICS_WORKFLOW
directory.
PROTEIN="" # Replace with name of protein from hmm_list.txt
anvi-interactive -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db \
-c METAGENOMICS_WORKFLOW/03_CONTIGS/"${PROTEIN}"-contigs.db \
--manual
Just want a quick look at the tree without read recruitment results?
PROTEIN="" # Replace with name of protein from hmm_list.txt
anvi-interactive -t 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.nwk \
-p 05_TREES/"${PROTEIN}"/"${PROTEIN}"-PROFILE.db \
--fasta 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.faa \
--manual
Calculating a phylogeny from ORFs recruited from a metagenomic assembly can result in some unnatural long branches. This can be caused by a variety of reasons including a misassembled sequences that ecophylo couldnβt removed automatically :(
To remove branches from the phylogenetic tree in the ecophylo interface, you can manually curate the tree and reimport it into anviβo. At the moment, anviβo does not have an automated program to do this but here is a workflow:
This is just a code outline. Please adjust parameters for the various steps to match your specific needs.
Step 1. Make collection of bad branches
Make a collection of branches you would like to remove and save it!
Step 2. Export collection and remove those sequences from the protein fasta file
HOME_DIR="ECOPHYLO"
PROTEIN="" # Replace with name of protein from hmm_list.txt
cd $HOME_DIR
mkdir SUBSET_TREE
# Export default collection and collection of bad branches
anvi-export-collection -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db -C DEFAULT --output-file-prefix SUBSET_TREE/DEFAULT
anvi-export-collection -C bad_branches -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db --output-file-prefix SUBSET_TREE/bad_branches
# Clean fasta headers
cut -f 1 SUBSET_TREE/bad_branches | sed 's|_split_00001||' > SUBSET_TREE/bad_branches_headers.txt
anvi-script-reformat-fasta 02_NR_FASTAS/"${PROTEIN}"/"${PROTEIN}"-AA_subset.fa --exclude-ids bad_branches_headers.txt \
-o SUBSET_TREE/"${PROTEIN}"-AA_subset_remove_bad_branches.fa
ALIGNMENT_PREFIX=""${PROTEIN}"-AA_subset_remove_bad_branches"
# Align
clusterize "muscle -in SUBSET_TREE/"${PROTEIN}"-AA_subset_remove_bad_branches.fa -out SUBSET_TREE/"${ALIGNMENT_PREFIX}".faa -maxiters 1" -n 15 -o 00_LOGS/align.log
# Trim
clusterize "trimal -in SUBSET_TREE/"${ALIGNMENT_PREFIX}".faa -out SUBSET_TREE/"${ALIGNMENT_PREFIX}"_trimmed.faa -gappyout" -n 5 -o 00_LOGS/trim.log
# Calculate tree
clusterize "FastTree SUBSET_TREE/"${ALIGNMENT_PREFIX}"_trimmed_filtered.faa > SUBSET_TREE/"${ALIGNMENT_PREFIX}"_trimmed_filtered_FastTree.nwk" -n 15 -o 00_LOGS/FastTree.log
Step 3. Use anvi-split
to remove bad branches from the ecophylo interface
PROTEIN="" # Replace with name of protein from hmm_list.txt
grep -v -f SUBSET_TREE/bad_branches_headers.txt SUBSET_TREE/collection-DEFAULT.txt | sed 's|EVERYTHING|EVERYTHING_curated|' > SUBSET_TREE/my_bins.txt
anvi-import-collection SUBSET_TREE/my_bins.txt -C curated \
-p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db \
-c METAGENOMICS_WORKFLOW/03_CONTIGS/"${PROTEIN}"-contigs.db
anvi-split -C curated \
--bin-id EVERYTHING_curated \
-p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db \
-c METAGENOMICS_WORKFLOW/03_CONTIGS/"${PROTEIN}"-contigs.db \
--output-dir SUBSET_TREE/"${PROTEIN}"_curated
Step 4. Add the string β_split_00001β to each tree leaf to import it back into the interface
packages <- c("tidyverse", "ape", "phytools", "glue")
suppressMessages(lapply(packages, library, character.only = TRUE))
add_split_string_to_tree <- function(IN_PATH, OUT_PATH) {
# Import tree
tree <- read.tree(IN_PATH)
# Create DF with tree tip metadata
tree_tip_metadata <- tree$tip.label %>%
as_tibble() %>%
rename(tip_label = value) %>%
mutate(tip_label = str_c(tip_label, "_split_00001"))
tree$tip.label <- tree_tip_metadata$tip_label
# Write DF
print(OUT_PATH)
write.tree(tree, file = OUT_PATH)
}
PROTEIN="" # Replace with name of protein from hmm_list.txt
add_split_string_to_tree(IN_PATH = glue("{PROTEIN}_trimmed_filtered_FastTree.nwk"),
OUT_PATH = glue("{PROTEIN}_trimmed_filtered_FastTree_ed.nwk"))
Step 5. Revisualize the subsetted tree
et voila!
Ecophylo will sanity check all input files that contain contigs-dbs before the workflow starts. This can take a while especially if you are working with 1000βs of genomes. If you want to skip sanity checks for contigs-dbs in your external-genomes and/or metagenomes then adjust your workflow-config to the following:
{
"run_genomes_sanity_check": false
}
The ecophylo workflow by default uses FastTree to calculate the output phylogenetic tree. This is because the workflow was designed to be run on large genomic datasets that could yield thousands of input sequences. However, if you like to run IQ-TREE adjust your workflow-config to the following:
{
"fasttree": {
"run": "",
"threads": 5
},
"iqtree": {
"threads": 5,
"-m": "MFP",
"run": true,
"additional_params": ""
},
}
Error in rule run_metagenomics_workflow:
jobid: 0
input: ECOPHYLO_WORKFLOW/METAGENOMICS_WORKFLOW/metagenomics_config.json
output: ECOPHYLO_WORKFLOW/METAGENOMICS_WORKFLOW/metagenomics_workflow.done
RuleException:
CalledProcessError in file /Users/mschechter/github/anvio/anvio/workflows/ecophylo/rules/profile_mode.smk, line 87:
Command 'set -euo pipefail; cd ECOPHYLO_WORKFLOW/METAGENOMICS_WORKFLOW && anvi-run-workflow -w metagenomics -c metagenomics_config.json --additional-params --rerun-incomplete --latency-wait 100 --keep-going &> 00_LOGS/run_metagenomics_workflow.log && cd -' returned non-zero exit status 1.
File "/Users/mschechter/github/anvio/anvio/workflows/ecophylo/rules/profile_mode.smk", line 87, in __rule_run_metagenomics_workflow
File "/Users/mschechter/miniconda3/envs/anvio-dev/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
One explanation for this error is none of the metagenomes in the samples-txt you provided mapped any reads to extracted target proteins. To test this run the following command and see if you get this error:
$ grep "Nothing to merge" ECOPHYLO_WORKFLOW/METAGENOMICS_WORKFLOW/00_LOGS/run_metagenomics_workflow.log
Command 'set -euo pipefail; echo Nothing to merge for Ribosomal_S11. This should only happen if all profiles were empty (you can check the log file: 00_LOGS/Ribosomal_S11-anvi_merge.log to see if that is indeed the case). This file was created just so that your workflow would continue with no error (snakemake expects to find these output files and if we don't create them, then it will be upset). As we see it, there is no reason to throw an error here, since you mapped your metagenome to some fasta files and you got your answer: whatever you have in your fasta file is not represented in your metagenomes. Feel free to contact us if you think that this is our fault. sincerely, Meren Lab >> 00_LOGS/Ribosomal_S11-anvi_merge.log' returned non-zero exit status 1.
Yes! But sadly, not at the same time, and anviβo feels really bad about that :(
To be clear, you can run one complete workflow, then change the path in the "hmm_list"
parameter in the config file to a different hmm-list then rerun the workflow on the same data in the same directory. For example, this ecophylo directory contains the outputs of Ribosomal_L16 and Ribosomal_S11 over the same data:
$ tree ECOPHYLO_WORKFLOW/ -L 2
ECOPHYLO_WORKFLOW/
βββ 01_REFERENCE_PROTEIN_DATA
βΒ Β βββ E_facealis_MAG
βΒ Β βββ Enterococcus_faecalis_6240
βΒ Β βββ Enterococcus_faecium_6589
βΒ Β βββ S_aureus_MAG
βΒ Β βββ co_assembly
βββ 02_NR_FASTAS
βΒ Β βββ Ribosomal_L16
βΒ Β βββ Ribosomal_S11
βββ 03_MSA
βΒ Β βββ Ribosomal_L16
βΒ Β βββ Ribosomal_S11
βββ 04_SEQUENCE_STATS
βΒ Β βββ Ribosomal_L16
βΒ Β βββ Ribosomal_S11
βββ 05_TREES
βΒ Β βββ Ribosomal_L16
βΒ Β βββ Ribosomal_L16_combined.done
βΒ Β βββ Ribosomal_S11
βΒ Β βββ Ribosomal_S11_combined.done
βββ 06_MISC_DATA
βΒ Β βββ Ribosomal_L16_estimate_scg_taxonomy_results-RAW-LONG-FORMAT.txt
βΒ Β βββ Ribosomal_L16_misc.tsv
βΒ Β βββ Ribosomal_L16_scg_taxonomy_data.tsv
βΒ Β βββ Ribosomal_S11_estimate_scg_taxonomy_results-RAW-LONG-FORMAT.txt
βΒ Β βββ Ribosomal_S11_misc.tsv
βΒ Β βββ Ribosomal_S11_scg_taxonomy_data.tsv
βββ METAGENOMICS_WORKFLOW
βΒ Β βββ 00_LOGS
βΒ Β βββ 03_CONTIGS
βΒ Β βββ 04_MAPPING
βΒ Β βββ 05_ANVIO_PROFILE
βΒ Β βββ 06_MERGED
βΒ Β βββ 07_SUMMARY
βΒ Β βββ Ribosomal_L16_ECOPHYLO_WORKFLOW_state.json
βΒ Β βββ Ribosomal_L16_add_default_collection.done
βΒ Β βββ Ribosomal_L16_state_imported_profile.done
βΒ Β βββ Ribosomal_S11_ECOPHYLO_WORKFLOW_state.json
βΒ Β βββ Ribosomal_S11_add_default_collection.done
βΒ Β βββ Ribosomal_S11_state_imported_profile.done
βΒ Β βββ fasta.txt
βΒ Β βββ metagenomics_config.json
βΒ Β βββ metagenomics_workflow.done
βΒ Β βββ samples.txt
βββ Ribosomal_L16_anvi_estimate_scg_taxonomy_for_SCGs.done
βββ Ribosomal_S11_anvi_estimate_scg_taxonomy_for_SCGs.done
To visualize the results of the different ecophylo runs, just change the paths to include the different proteins:
PROTEIN="" # Replace with name of protein from hmm_list.txt
anvi-interactive -c METAGENOMICS_WORKFLOW/03_CONTIGS/"${PROTEIN}"-contigs.db -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db
However, if you are interested in comparing the outputs of different parameters on the same protein, make a new workflow-config file, otherwise ecophylo will try and fail to overwrite your original run.
To do this, first make your new workflow-config file:
cp config_RP_L16.json config_RP_L16_new_parameters.json
Next, adjust any parameters you want!
Finally, edit the "HOME"
string to a new path to ensure you make a new directory structure, like this:
# edit
"output_dirs": {
"HOME": "ECOPHYLO_WORKFLOW_new_parameters",
"EXTRACTED_RIBO_PROTEINS_DIR": "ECOPHYLO_WORKFLOW/01_REFERENCE_PROTEIN_DATA",
"RIBOSOMAL_PROTEIN_FASTAS": "ECOPHYLO_WORKFLOW/02_NR_FASTAS",
"MSA": "ECOPHYLO_WORKFLOW/03_MSA",
"RIBOSOMAL_PROTEIN_MSA_STATS": "ECOPHYLO_WORKFLOW/04_SEQUENCE_STATS",
"TREES": "ECOPHYLO_WORKFLOW/05_TREES",
"MISC_DATA": "ECOPHYLO_WORKFLOW/06_MISC_DATA",
"SCG_NT_FASTAS": "ECOPHYLO_WORKFLOW/07_SCG_NT_FASTAS",
"RIBOSOMAL_PROTEIN_FASTAS_RENAMED": "ECOPHYLO_WORKFLOW/08_RIBOSOMAL_PROTEIN_FASTAS_RENAMED",
"LOGS_DIR": "00_LOGS"
},
Yes you can add more genomes and metagenomes in your metagenomes, external-genomes, and samples-txt.
BUT, you need to do a couple of steps first so that Snakemake can restart all the processes and maintain as much data as possible:
If you are just adding more metagenomes to your samples-txt, we luckily can preserve the majority of files in the METAGENOMICS_WORKFLOW/
. First, edit your samples-txt files to add more metagenomes, then delete these files below, and finally restart the workflow.
HOME_DIR="ECOPHYLO_WORKFLOW"
PROTEIN="Ribosomal_S11"
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/06_MERGED/
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/07_SUMMARY/
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_ECOPHYLO_WORKFLOW_state.json
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_add_default_collection.done
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_state_imported_profile.done
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/metagenomics_config.json
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/metagenomics_workflow.done
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/samples.txt
If you want to add more assemblies to your metagenomes or external-genomes, youβll sadly need to delete the whole METAGENOMICS_WORKFLOW/
and restart the workflow.
HOME_DIR="ECOPHYLO_WORKFLOW"
PROTEIN="Ribosomal_S11" # Replace with name of protein from hmm_list.txt
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/
Edit this file to update this information.