The flagship anvi'o program to profile a BAM file. Running this program on a BAM file will quantify coverages per nucleotide position in read recruitment results and will average coverage and detection data per contig. It will also calculate single-nucleotide, single-codon, and single-amino acid variants, as well as structural variants, such as insertion and deletions, to eventually stores all data into a single anvi'o profile database. For very large projects, this program can demand a lot of time, memory, and storage resources. If all you want is to learn coverages of your nutleotides, genes, contigs, or your bins collections from BAM files very rapidly, and/or you do not need anvi'o single profile databases for your project, please see other anvi'o programs that profile BAM files,
🔙 To the main page of anvi’o programs and artifacts.
Once you have a single-profile-db, you can run programs like anvi-cluster-contigs, anvi-estimate-metabolism, and anvi-gen-gene-level-stats-databases, as well as use the interactive interface with anvi-interactive. If you want to run these same contigs against multiple BAM files (because you have multiple samples), you’ll combine your single-profile-dbs into a profile-db after you’ve created them all using anvi-merge. See the pages for single-profile-db or profile-db for more you can do with these artifacts.
In short, this program runs various analyses on the contigs in your contigs-db and how they relate to the sample information stored in the bam-file you provided. It then stores this information into a single-profile-db. Specifically, this program calculates
This program takes in an indexed bam-file and a contigs-db. The BAM file contains the short reads from a single sample that will be used to create the profile database. Thus, here is a standard run with default parameters:
Alternatively, if you lack mapping data, you can add the flag
--blank-profile so that you can still get the functionality of a profile database.
anvi-profile -c contigs-db \ --blank-profile
If you want to first check your BAM file to see what contigs it contains, just use the flag
--list-contigs to see a comprehensive list.
Note: This describes how to profile a named subset of contigs. To profile a subset of contigs based on their characterists (for example, only contigs of a certain length or that have a certain coverage), see the section below on “contig specifications”
By default, anvi’o will use every contig in your contigs-db. However, if you wish to focus specifically on a subset of these contigs, just provide a file that contains only the names of the contigs you want to analyze, one per line, using the tag
For example, you could run
anvi-profile -c Ross_sea_contigs.db \ --blank-profile \ --contigs-of-interest contigs_i_like.txt
contigs_i_like.txt looks like this:
Changing these will affect the way that your sequences are analyzed.
Keep in mind that if you plan to merge your resulting single-profile-db with others later in the project, you’ll want to keep these parameters consistent.
To profile only contigs within a specific length, you can use the flags
-max-contig-length. By default, the minimum length for analysis is 1000 and there is no maximum length. You can also profile only contigs that have a certain average coverage with the flag
You can also ignore reads in your BAM file with a percent identity to the reference less than some threshold using the flag
--min-percent-identity. By default, all reads are used.
For example, the following code will only look at contigs longer than 2000 nts and will ignore BAM file reads with less than 95 percent identity to the reference:
anvi-profile -c Ross_sea_contigs.db \ -i bam_file.bam \ --min-contig-length 2000 \ --min-percent-identity 95
By default, anvi’o fetches all reads from the bam file. With
--fetch-filter you can determine which reads from a bam file will be used for profiling. The current filters are:
double-forwards: only paired-end reads with both R1 and R2 with a ‘forward’ orientation,
double-reverses: only paired-end reads with both R1 and R2 with a ‘reverse’ orientation,
inversions: only paired-end reads with both R1 and R2 either ‘forward’ or ‘reverse’ and a maximum insert size of 2000 nts,
single-mapped-reads: only single mapped reads (mate is unmapped),
distant-pairs-1K: only paired-end reads with a minimum 1000 nts insert size.
For example, the following code only considers ‘inversions’ reads:
anvi-profile -c Ross_sea_contigs.db \ -i bam_file.bam \ --fetch-filter inversions
By default, anvi’o will not try to cluster your splits (since it takes quite a bit of runtime) unless you are using the tag
--blank-profile. If you don’t want to run this, use the tag
If you’re planning to later merge this sample with others, it is better to perform clustering while running anvi-merge than at this stage.
However, if you want to bin this single sample or otherwise want clustering to happen, just use the tag
If you do plan to cluster, you can set a custom distance metric or a custom linkage method.
Anvi-profile will throw away variability data below certain thresholds to reduce noise. After all, if you have a single C read at a position with a 1000X coverage where all other reads are T, this is probably not a variant position that you want to investigate further. By default, it will not analyze positions with coverage less than 10X, and it will further discard variants based on this criteria.
However, you can change the coverage threshold using the
--min-coverage-for-variability flag. You can also report every variability position using the flag
For example, if you wanted to view every variant, you would profile with the following:
anvi-profile -c Ross_sea_contigs.db \ -i bam_file.bam \ --min-coverage-for-variability 1 \ --report-variability-full
You should provide the sample name with the flag
-S and can provide a description of your project using the
--description tag followed by a text file. These will help anvi’o name output files and will show up in the anvi’o interfaces down the line.
You can characterize the codon frequencies of genes in your sample at the cost of some runtime. Despite time being money, codon frequency analysis can be helpful downstream. Simply add the tag
--profile-SCVs and watch the magic happen.
If you have prior experience with
--profile-SCVs being slow, you will be surprised how fast it is
Alternatively, you can choose not to store insertion and deletion data or single nucleotide variant data.
If you know the limits of your system, you can also multithread this program. See the program help menu for more information.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the
__resources__ tag in this file to see an example.