Reformat FASTA file (remove contigs based on length, or based on a given list of deflines, and/or generate an output with simpler names).
🔙 To the main page of anvi’o programs and artifacts.
contigs-fasta contig-rename-report-txt
A simpe program to perform a combination of simple operations on a FASTA file including,
N
for nucleotide sequences that are not A, C, T, or G, or to replace any character with X
for amino acid sequences if the character does not match any of the single-letter amino acid characters (useuful to make sure the input file conforms the expectations of that input file type (i.e., all DNA sequences, or all AA sequences, etc)).This program can work with compressed input FASTA files (i.e., the file name ends with a .gz
extention) and will report a compressed output FASTA file (i.e., if the output file name ends with a .gz
extension).
One of the most useful tasks this program performs is to simplify the deflines in your fasta file so they meet the conditions required of a contigs-fasta that is required by other anvi’o programs. You can simplify deflines in a fasta file the following way:
anvi-script-reformat-fasta fasta \ -o contigs-fasta \ --simplify-names \ --report-file contig-rename-report-txt
The --report-file
flag is quite important to use here as it will generate a TAB-delimited file, contig-rename-report-txt, to keep track of which defline in the new file corresponds to which defline in the original file.
If your fasta file includes a lot of very short contigs, removing them may dramatically improve the performance of the generation and processing of your contigs-db. The example below runs the same command while also removing sequences that are shorter than 1,000 nts:
anvi-script-reformat-fasta fasta \ -o contigs-fasta \ -l 1000 \ --simplify-names \ --report-file contig-rename-report-txt
anvi-script-reformat-fasta contigs.fa \
--simplify-names \
--prefix YYY \
--min-len 1000 \
--seq-type NT \
--overwrite-input
Input ........................................: contigs.fa
Output .......................................: (anvi'o will overwrite your input file)
WHAT WAS THERE
===============================================
Total num contigs ............................: 4,189
Total num nucleotides ........................: 35,766,167
WHAT WAS ASKED
===============================================
Simplify deflines? ...........................: Yes
Add prefix to sequence names? ................: Yes, add 'YYY'
Minimum length of contigs to keep ............: 1,000
Max % gaps allowed ...........................: 100.00%
Max num gaps allowed .........................: 1,000,000
Exclude specific sequences? ..................: No
Keep specific sequences? .....................: No
Enforce sequence type? .......................: Yes, enforce 'NT'
WHAT HAPPENED
===============================================
Contigs removed ..............................: 3,156 (75.34% of all)
Nucleotides removed ..........................: 6,121,239 (17.11% of all)
Nucleotides modified .........................: 161 (0.00045% of all)
Deflines simplified ..........................: True
* The contents of your input file have changed because you used the flag
`--overwrite-input`.
Please use the flag --overwrite-input
with extreme caution.
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__
tag in this file to see an example.