anvi-integrate-trnaseq

Integrate tRNA-seq with (meta)genomic data, relating tRNA-seq seeds to tRNA genes.

🔙 To the main page of anvi’o programs and artifacts.

Authors

Can consume

trnaseq-contigs-db seeds-specific-txt modifications-txt contigs-db

Can provide

trna-gene-hits

Usage

This program integrates tRNA-seq seeds with tRNA genes in (meta)genomic contigs. It produces trna-gene-hits in a trnaseq-contigs-db.

Seeds, or predicted tRNA transcripts, are the end product of the anvi’o trnaseq workflow. Seeds are generated by the program anvi-merge-trnaseq, with seeds being the merged product of the tRNA sequence information, including modifications, predicted by anvi-trnaseq from individual tRNA-seq samples in an experiment. anvi-merge-trnaseq produces a trnaseq-contigs-db and trnaseq-profile-dbs. Seed and (specific) coverage information from these databases are summarized by anvi-tabulate-trnaseq in a concise, easily parsable table, seeds-specific-txt (see the trnaseq-profile-db artifact for an explanation of “specific” and “nonspecific” coverage). Modification information from the databases are summarized a modifications-txt table. The trnaseq-contigs-db is a mandatory input, as it is modified by anvi-integrate-trnaseq, and the seeds-specific-txt and modifications-txt tables are mandatory inputs used in the computational steps of the program.

The final mandatory input is a (meta)genomic contigs-db annotated with tRNA gene calls by anvi-scan-trnas. The program related seeds (predicted transcripts) to tRNA genes responsible for their expression. Multiple copies of the same tRNA gene are often present in a genome, so read recruitment to any one of these genes cannot be resolved. In a metagenome, the same tRNA gene may be found in multiple bins or in binned and unbinned contigs. Though this can be due to binning errors, identical tRNA genes are often found across taxa as a consequence of their short length and, in many cases, broad phylogenetic distribution. Currently, when the user instructs anvi-integrate-trnaseq to be conscious of a collection of bins, the program ignores seeds matched to genes that are not restricted to a single bin.

Gene search process

Seed sequence permutation

The first step in this program is permutation of seed sequences at sites of predicted modifications, to generate a set of permuted sequences per seed. Permuted sequences, including the original seed sequence, are BLASTed against the tRNA genes from the contigs database. Nucleotide permutations account for the fact that the majority nucleotide at a position – which is used in the seed sequence – need not be the unmodified nucleotide. Modification-induced substitutions in tRNA-seq reads result in a semi-random set of nucleotides at modified positions, with mutation fraction proportional to modification fraction. The permuted sequence with the strongest alignment to one or more genes is selected and the others discarded. If multiple permuted sequences have equally strong hits, the one with fewer permutations is favored.

There are two parameters related to permutation.

--min-nt-frequency is the minimum relative frequency, summed across all tRNA-seq samples, required for a nucleotide to be substituted in a permuted sequence. For example, the default value of 0.05 means that a nucleotide with a frequency of 0.04 at a predicted modified position in a seed will not be used in permuted sequences of the seed.

--max-variable-positions is the maximum number of nucleotides that can be permuted at once in a seed sequence. For example, the default value of 5 means that a seed with 6 predicted modification positions will generate a set of permuted sequences with 1, 2, 3, 4, or 5 of these positions containing different nucleotides than the seed (subject to the constraint of --min-nt-frequency, which may preclude permutation of certain positions with low-level mutations).

Alignment

The parameter, --max-mismatches, sets the maximum number of mismatches allowed in the alignment of a (permuted) seed to a tRNA gene. The default value is 3.

Selected seed sequences (permuted or not) may differ from aligned gene sequences for two principal reasons. First, the seed can have undetected modification-induced nucleotide substitutions (or indels, which are ignored here). For example, A34 in many tRNAs can be fully modified to inosine, which is detected as a G rather than A in tRNA-seq reads. Second, the seed can have a single nucleotide variant that differs from the contig sequence. It is not uncommon for populations to have tRNAs with SNVs. When a seed matches multiple genes, it may be possible to distinguish latent modifications from SNVs: if the different genes contain different nucleotides at the position in question, the seed mismatch is likely due to a SNV rather than a modification.

Alignments are ungapped and must cover the full length of the seed query, except for the post-transcriptional G added to the 5’ end of tRNA-His. Since tRNA-seq reads start from the 3’ end of tRNA and often stop before the 5’ end, seeds need not cover the whole tRNA gene (unless specified in the trnaseq workflow). Seeds must align with the 3’ end of the gene or with the 3’ end of the gene less the 3 nucleotides, CCA: the 3’-CCA acceptor sequence was removed from the seed sequence, as it is often, but not always, added post-transcriptionally.

Edit this file to update this information.

Additional Resources

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.