anvi-get-sequences-for-gene-calls

A script to get back sequences for gene calls.

🔙 To the main page of anvi’o programs and artifacts.

Authors

Can consume

contigs-db genomes-storage-db

Can provide

genes-fasta external-gene-calls

Usage

This program allows you to export the sequences of your gene calls from a contigs-db or genomes-storage-db in the form of a genes-fasta.

If you want other information about your gene calls from a contigs-db, you can run anvi-export-gene-calls (which outputs a gene-calls-txt) or get the coverage and detection information with anvi-export-gene-coverage-and-detection.

Running on a contigs database

You can run this program on a contigs-db like so:

anvi-get-sequences-for-gene-calls -c contigs-db \ -o path/to/output

This is create a genes-fasta that contains every gene in your contigs database. If you only want a specific subset of genes, you can run the following:

anvi-get-sequences-for-gene-calls -c contigs-db \ -o path/to/output \ --gene-caller-ids 897,898,1312 \ --delimiter ,

Now the resulting genes-fasta will contain only those three genes.

Please note that this program allows you to format the deflines of the resulting FASTA file to a great extent. For this, it uses a set of previously-defined variables you can use to define a template of your liking. You can learn about the available variables, you can include the following flag in your command:

anvi-get-sequences-for-gene-calls -c contigs-db \ -o path/to/output \ --list-defline-variables

which will give you an output similar to this:

WARNING
===============================================
Here are the variables you can use to provide a user-defined defline template:

* {gene_caller_id}
* {contig_name}
* {start}
* {stop}
* {direction}
* {length}
* {contigs_db_project_name}

Remember, by default, anvi'o will only use '{gene_caller_id}' to format the
deflines of FASTA files it produces.

Now you can use the --defline-format parameter with any combination of these variables to control the output FASTA file deflines. Here are some examples using the contigs-db file included in the Infant Gut Dataset, and the contents of the resulting FASTA file:

anvi-get-sequences-for-gene-calls -c INFANT-GUT-TUTORIAL/CONTIGS.db \
                                  --gene-caller-ids 77 \
                                  -o 77.fa \
                                  --defline-format "{gene_caller_id}"
>77
ATGAGTAATTTATTGCGAGCAGAAGGCGTGTCTTATCAAGTAAATGGTCGTAGCATTCTTTCTGATATTGATTTGTCATTTGAAACAGGCAGCAATACAACAATTGTTGGTCCTTCAGGT
AGCGGGAAAAGTACATTTTTAAAAATTTTATCTTCATTATTAAGTCCTACAGAAGGCGAAATTTTTTATCAAGAAGCGCCAATTACTACAATGCCAATCGAAACATACCGCCAAAAGGTT
TCTTATTGTTTTCAGCAGCCAACTTTATTTGGTGAAACCGTGTATGATAATTTGTTATTTCCATTTACCGTCAGACAAGAAGCGTTTAATCAGGAAAAAGTCGTGGCATTACTCCAACAA
GTGAAATTGCCCGCTGCTTATCTTGAAAAGAAAATAGCCGAACTCTCTGGTGGTGAGCGACAACGGGTTGCTTTGCTACGAAACATTATTTTTGTACCAGATGTTTTATTATTAGACGAA
GTTACAACGGGATTAGATGAAGAAAGCAAACAGATTGTCAATCAATTGTTAAACCAATTAAACAAAGAGCAAGGAGTCACGCTGGTTCGTGTCACGCATGATACCGAAGAAATTCAGCAA
GCACAGCAAGTGATTCGTATTGTAGCAGGAAAGGTGGCGCCGACAGATGGATTTAGCAGTTAA

The simple option shown above also is the default defline format anvi’o uses. Here is a more sophisticated example:

anvi-get-sequences-for-gene-calls -c INFANT-GUT-TUTORIAL/CONTIGS.db \
                                  --gene-caller-ids 77 \
                                  -o 77.fa \
                                  --defline-format "{contigs_db_project_name}_{contig_name}_{gene_caller_id}"
>Infant_Gut_Contigs_from_Sharon_et_al_Day17a_QCcontig1_77
ATGAGTAATTTATTGCGAGCAGAAGGCGTGTCTTATCAAGTAAATGGTCGTAGCATTCTTTCTGATATTGATTTGTCATTTGAAACAGGCAGCAATACAACAATTGTTGGTCCTTCAGGT
AGCGGGAAAAGTACATTTTTAAAAATTTTATCTTCATTATTAAGTCCTACAGAAGGCGAAATTTTTTATCAAGAAGCGCCAATTACTACAATGCCAATCGAAACATACCGCCAAAAGGTT
TCTTATTGTTTTCAGCAGCCAACTTTATTTGGTGAAACCGTGTATGATAATTTGTTATTTCCATTTACCGTCAGACAAGAAGCGTTTAATCAGGAAAAAGTCGTGGCATTACTCCAACAA
GTGAAATTGCCCGCTGCTTATCTTGAAAAGAAAATAGCCGAACTCTCTGGTGGTGAGCGACAACGGGTTGCTTTGCTACGAAACATTATTTTTGTACCAGATGTTTTATTATTAGACGAA
GTTACAACGGGATTAGATGAAGAAAGCAAACAGATTGTCAATCAATTGTTAAACCAATTAAACAAAGAGCAAGGAGTCACGCTGGTTCGTGTCACGCATGATACCGAAGAAATTCAGCAA
GCACAGCAAGTGATTCGTATTGTAGCAGGAAAGGTGGCGCCGACAGATGGATTTAGCAGTTAA

And finally, here is an example that is quite comprehensive in its request from anvi’o:

anvi-get-sequences-for-gene-calls -c INFANT-GUT-TUTORIAL/CONTIGS.db \
                                  --gene-caller-ids 77 \
                                  -o 77.fa \
                                  --defline-format "{gene_caller_id} contig_name:{contig_name}|gene_start:{start}|gene_stop:{stop}|gene_length:{length}|gene_direction:{direction}"
>77 contig_name:Day17a_QCcontig1|gene_start:67276|gene_stop:67939|gene_length:663|gene_direction:f
ATGAGTAATTTATTGCGAGCAGAAGGCGTGTCTTATCAAGTAAATGGTCGTAGCATTCTTTCTGATATTGATTTGTCATTTGAAACAGGCAGCAATACAACAATTGTTGGTCCTTCAGGT
AGCGGGAAAAGTACATTTTTAAAAATTTTATCTTCATTATTAAGTCCTACAGAAGGCGAAATTTTTTATCAAGAAGCGCCAATTACTACAATGCCAATCGAAACATACCGCCAAAAGGTT
TCTTATTGTTTTCAGCAGCCAACTTTATTTGGTGAAACCGTGTATGATAATTTGTTATTTCCATTTACCGTCAGACAAGAAGCGTTTAATCAGGAAAAAGTCGTGGCATTACTCCAACAA
GTGAAATTGCCCGCTGCTTATCTTGAAAAGAAAATAGCCGAACTCTCTGGTGGTGAGCGACAACGGGTTGCTTTGCTACGAAACATTATTTTTGTACCAGATGTTTTATTATTAGACGAA
GTTACAACGGGATTAGATGAAGAAAGCAAACAGATTGTCAATCAATTGTTAAACCAATTAAACAAAGAGCAAGGAGTCACGCTGGTTCGTGTCACGCATGATACCGAAGAAATTCAGCAA
GCACAGCAAGTGATTCGTATTGTAGCAGGAAAGGTGGCGCCGACAGATGGATTTAGCAGTTAA

You also have the option to report the output in gff3 format, report extended deflines for each gene, or report amino acid sequences instead of nucleotide sequences.

Running on a genomes storage database

You can also get the sequences from gene calls in a genomes-storage-db, like so:

anvi-get-sequences-for-gene-calls -g genomes-storage-db \ -o path/to/output

This will create a genes-fasta that contains every gene in your genomes storage database. To focus on only a subset of the genomes contained in your database, use the flag --genome-names. You can provide a comma-delimited list of genome names or a flat text file that contains one genome per line. Alternatively, you could provide a list of gene-caller-ids as specified above.

You also have the option to report the output in gff3 format, report extended deflines for each gene, or report amino acid sequences instead of nucleotide sequences.

Edit this file to update this information.

Additional Resources

Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.