anvi-split

Split an anvi'o pan or profile database into smaller, self-contained projects. Black magic..

🔙 To the main page of anvi’o programs and artifacts.

Authors

Requires

profile-db collection

Can use

contigs-db genomes-storage-db pan-db contig-classification collection-txt

Provides

split-bins

Usage

Creates individual, self-contained anvi’o projects for one or more bins stored in an anvi’o collection. This program may be useful if you would like to share a subset of an anvi’o project with the community or a collaborator, or focus on a particular aspect of your data without having to initialize very large files. Altogether, anvi-split promotoes reproducibility, openness, and collaboration.

The program can generate split-bins from metagenomes, from pangenomes, or from a contigs-db on its own (without a profile-db).

Each of the resulting directories in your output folder will contain a stand-alone anvi’o project that can be used or shared without requiring access to any files of the original (larger) dataset.

Splitting metagenomes and pangenomes

To split bins from a metagenome, you can provide the program anvi-split with a contigs-db and profile-db pair. To split gene clusters from a pangenome, you can provide it with a genomes-storage-db and pan-db pair. In both cases you will also need a collection. If you don’t provide any bin names, the program will create individual directories for each bin that is found in your collection. You can also limit the output to a single bin.

An example run

Assume you have a profile-db has a collection with three bins, which are (very creatively) called BIN_1, BIN_2, and BIN_3.

If you ran the following code:

anvi-split -p profile-db \ -c contigs-db \ -C collection \ -o OUTPUT

You would get 3 new pairs of profile-db and contigs-db files, one for each bin, located in OUTPUT/BIN_1/, OUTPUT/BIN_2/, and OUTPUT/BIN_3/.

Alternatively, you can specify a bin name to limit the reported bins:

anvi-split -p profile-db \ -c contigs-db \ -C collection \ --bin-id BIN_1 -o OUTPUT

Similarly, if you provide a genomes-storage-db and pan-db pair, the directories will contain their own smaller genomes-storage-db and pan-db pairs.

You can always use the program anvi-show-collections-and-bins to learn available collection and bin names in a given profile-db or pan-db.

Performance

For extremely large datasets, splitting bins may be difficult. For metagenomics projets you can,

  • Use the flag --skip-variability-tables to NOT report single-nucleotide variants or single-amino acid variants in your split bins (which can reach hundreds of millions of lines of information for large and complex metagenomes), and/or,
  • Use the flag --compress-auxiliary-data to save space. While this is a great option for data that is meant to be stored long-term and shared with the community, the compressed file would need to be manually decompressed by the end-user prior to using the split bin.

Splitting a contigs database without a profile database

anvi-split can split a contigs-db on its own, without any profile-db. Each resulting directory will contain a self-contained contigs-db for that group of contigs. Two input modes are available. You will need either a collection-txt file mapping contigs to bins, or per-contig domain-level classification data previously imported with anvi-import-contig-classification.

Using an external collection file

You can provide a two-column, TAB-delimited file with no header, where column 1 is the contig name and column 2 is the bin name:

anvi-split -c contigs-db \ --collection-txt collection-txt \ -o OUTPUT

Using contig classification data

If your contigs-db has classification data imported with anvi-import-contig-classification, you can split it by contig class:

anvi-split -c contigs-db \ --split-by-contig-classification \ -o OUTPUT

Each class (e.g., virus, plasmid, non-eukaryotic) will become a separate output database. You can limit the output to specific classes with --classes-to-keep:

anvi-split -c contigs-db \ --split-by-contig-classification \ --classes-to-keep virus,plasmid \ -o OUTPUT

Handling classification conflicts

If your contigs-db has contig-classification data from multiple sources, the same contig may be assigned different classes by different sources. anvi-split will raise an error when it encounters such conflicts. For example, the following classification table has data from two sources, whokaryote and alien. Both agree on contig1 through contig3, but disagree on contig4 through contig6 — whokaryote assigns them class 1 (eukaryotic) while alien assigns them class 2 (virus):

contig class source tool_classification confidence
contig1 1 whokaryote eukaryote NA
contig2 1 whokaryote eukaryote NA
contig3 1 whokaryote eukaryote NA
contig4 1 whokaryote eukaryote NA
contig5 1 whokaryote eukaryote NA
contig6 1 whokaryote eukaryote NA
contig1 1 alien eukaryote NA
contig2 1 alien eukaryote NA
contig3 1 alien eukaryote NA
contig4 2 alien virus NA
contig5 2 alien virus NA
contig6 2 alien virus NA

anvi-split will refuse to proceed until you decide how to handle them. You have three options:

  • --only-use-classification-source SOURCE: only use classifications from one source, ignoring the other sources entirely.
  • --allow-multiple-classifications: allow conflicting contigs to appear in all output splits they were assigned to.
  • --mark-conflicting-contigs-as-ambiguous: redirect conflicting contigs into a separate ambiguous split and write a report file documenting their original classifications.

Edit this file to update this information.

Additional Resources

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.