anvi-script-get-primer-matches [program]

You provide this program with FASTQ files for one or more samples AND one or more short sequences, and it collects reads from FASTQ files that matches to your sequences. This tool can be most powerful if you want to collect all short reads from one or more metagenomes that are downstream to a known sequence. Using the comprehensive output files you can analyze the diversity of seuqences visually, manually, or using established strategies such as oligotyping..

🔙 To the main page of anvi’o programs and artifacts.

Authors

A. Murat Eren (Meren)

Can consume

samples-txt

Can provide

short-reads-fasta

Usage

This program finds all reads in a given set of FASTQ files based on user-provided primer sequences.

The primary utility of this program is to get back short reads that may be extending into hypervariable regions of genomes that often suffer from significant drops in coverage in conventional read-recruitment analyses, thus preventing any meaningful insights into coverage or variability patterns.

In these situations, one can identify downstream conserved sequences (typically 15 to 25 nucleotides long) using the anvi’o interactive interface or through other means, and then provide those sequences to this program so it can find all matching sequences in a set of FASTQ files without any mapping.

To instead get short reads mapping to a gene, use anvi-get-short-reads-mapping-to-a-gene.

Here is a typical command line to run it:

anvi-script-get-primer-matches --samples-txt samples.txt \ --primer-sequences sequences.txt \ --output-dir OUTPUT

The samples-txt file is to list all the samples one is interested in, and the primer sequences file lists each primer sequence of interest. Each of these files can contain a single entry, or multiple ones.

This will output all of the matching sequences into three fasta files in the directory OUTPUT. These fasta files differ in their format and will include those that describe,

  • Raw sequences: sequences from the FASTQ files that matched to a primer where each sequence reported as is with no processing.
  • Trimmed sequences: Raw sequences where the upstream of the primer sequence trimmed, as a result all matching sequences will start at the same position, and
  • Gapped sequences: Trimmed sequences padded with gap characters to eliminate length variation artificially.

The last two formats provide downstream possibilities to generate oligotypes and cluster short reads from an hypervariable region to estimate their diversity and oligotype proportion.

Edit this file to update this information.

Additional Resources

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.