This script report errors in long read assembly using read-recruitment information. The input file should be a BAM file of long reads mapped to an assembly made from these reads..
🔙 To the main page of anvi’o programs and artifacts.
This program does not seem to provide any artifacts. Such programs usually print out some information for you to see or alter some anvi’o artifacts without producing any immediate outputs.
The aim of this script is to find potential assembly errors from long read assemblers.
This script searches for potential assembly errors in a bam-file generated from the mapping of long reads onto an assembly made using the same reads. The basic principle is that (1) every single nucleotide, and (2) consecutive pairs/triplets of nucleotides in an assembly should be covered by at least one of the long reads that was used to generate the assembly.
To find out if every nucleotide is covered by at least one read, one simply needs to find regions with 0x coverage, and that is one output of this script.
To find where two/three consecutive nucleotides are not covered by at least one read, we can leverage the clipping information reported by long read mapping software like minimap2. Clipping happens when the left- or right-most part of a read does not align to the reference. If 100% of reads clip at the same nucleotide position, then it means that not a single read is covering at least two consecutive nucleotides. AND THAT’S NOT GOOD.
Here is an example visualized in IGV. All reads are clipped at the same position. There is no support in the long reads that the left and right sides of the contig should be joined, suggesting that there is a misassembly here.
The only input file to this script is a simple bam-file. But not any BAM file. It has to made by mapping long reads onto an assembly generated using the same reads. You also need to provide an output prefix.
anvi-script-find-misassemblies -b sample01.bam -o result
The first output is a table reporting regions in your assemblies with zero coverage. This output includes the contig in which the region occurs, the contig’s length, the region’s start and stop positions, and the length of the region with no coverage.
contig | length | range | range_size |
---|---|---|---|
contig_001 | 1665603 | 0-498 | 498 |
contig_001 | 1665603 | 100500-101000 | 500 |
contig_001 | 1665603 | 1665106-1665603 | 497 |
The second output is a table reporting the positions with high relative abundance of clipped reads. It includes the contig in which the position occurs, the contig’s length, the position in the contig with high levels of clipping, its relative position in the contig, the total coverage at that position, the coverage of reads clipping at that position, and the ratio between these two coverage values. The ‘relative position’ is the position divided by the contig’s length (a value from 0 to 1), which tells you whether the position is close to the contig ends or somewhere in the middle.
contig | length | pos | relative_pos | cov | clipping | clipping_ratio |
---|---|---|---|---|---|---|
contig_001 | 1665603 | 498 | 0.0002989908159387321 | 48 | 48 | 1.0 |
contig_001 | 1665603 | 500999 | 0.30079136504917436 | 120 | 120 | 1.0 |
contig_001 | 1665603 | 501000 | 0.30079196543233894 | 79 | 79 | 1.0 |
contig_001 | 1665603 | 1665105 | 0.9997010091840612 | 45 | 45 | 1.0 |
By default, the script will report clipping positions if the ratio of clipping reads to the total number of reads is over 80%. You can change that threshold with the flag --clipping-ratio
:
anvi-script-find-misassemblies -b sample01.bam -o result --clipping-ratio 0.6
Another default behavior of the script is to skip the first and last 100bp of a contig (only valid for the table reporting positions of high clipping). This is because contig ends often have high proportions of clipping which have nothing to do with misassemblies. You can change that parameter with the flag --min-dist-to-end
:
anvi-script-find-misassemblies -b sample01.bam -o result --min-dist-to-end 0
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.