anvi-script-process-genbank-metadata [program]

This script takes the 'metadata' output of the program ncbi-genome-download (see https://github.com/kblin/ncbi-genome-download for details), and processes each GenBank file found in the metadata file to generate a FASTA file, as well as genes and functions files for each entry. Plus, it autmatically generates a FASTA TXT file descriptor for anvi'o snakemake workflows. So it is a multi-talented program like that.

🔙 To the main page of anvi’o programs and artifacts.

Authors

A. Murat Eren (Meren)
Daniel Blankenberg

Can consume

This program seems to know what its doing. It needs no input material from its user. Good program.

Can provide

contigs-fasta functions-txt external-gene-calls

Usage

Suppose you have downloaded some genomes from NCBI (using this incredibly useful program) and you have a metadata table describing those genomes. This program will convert that metadata table into some useful files, namely: a FASTA file of contig sequences, an external gene calls file, and an external functions file for each genome you have downloaded; as well as a single tab-delimited fasta-txt file (like the one shown here) describing the path to each of these files for all downloaded genomes (that you can pass directly to a snakemake workflow if you need to). Yay.

The metadata file

The prerequisite for running this program is to have a tab-delimited metadata file containing information about each of the genomes you downloaded from NCBI. Let’s say your download command started like this: ncbi-genome-download --metadata-table ncbi_metadata.txt -t .... So for the purposes of this usage tutorial, your metadata file is called ncbi_metadata.txt.

In case you are wondering, that file should have a header that looks something like this:

assembly_accession	bioproject	biosample	wgs_master	excluded_from_refseq	refseq_category	relation_to_type_material	taxid	species_taxid	organism_name	infraspecific_name	isolate	version_status	assembly_level	release_type	genome_rep	seq_rel_dateasm_name	submitter	gbrs_paired_asm	paired_asm_comp	ftp_path	local_filename

Basic usage

If you run this, all the output files will show up in your current working directory.

anvi-script-process-genbank-metadata -m ncbi_metadata.txt

Choosing an output directory

Alternatively, you can specify a directory in which to generate the output:

anvi-script-process-genbank-metadata -m ncbi_metadata.txt -o DOWNLOADED_GENOMES

Picking a name for the fasta-txt file

The default name for the fasta-txt file is fasta-input.txt, but you can change that with the --output-fasta-txt parameter.

anvi-script-process-genbank-metadata -m ncbi_metadata.txt --output-fasta-txt ncbi_fasta.txt

Make a fasta-txt without the gene calls and functions columns

The default columns in the fasta-txt file are:

name	path	external_gene_calls	gene_functional_annotation

But sometimes, you don’t want your downstream snakemake workflow to use those external gene calls or functional annotations files. So to skip adding those columns into the fasta-txt file, you can use the -E flag:

anvi-script-process-genbank-metadata -m ncbi_metadata.txt --output-fasta-txt ncbi_fasta.txt -E

Then the fasta-txt will only contain a name column and a path column.

Edit this file to update this information.

Additional Resources

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.