Download, extract, and gzip paired-end FASTQ files automatically from the NCBI short-read archive (SRA)
The sra-download workflow automatizes the process of downloading paired-end FASTQ files for a given list of SRA-accessions using NCBI sra-tools wiki then gzips them using pigz.
🔙 To the main page of anvi’o programs and artifacts.
The sra-download can typically be initiated with the following artifacts:
The sra-download typically produce the following anvi’o artifacts:
This is a list of programs that may be used by the sra-download workflow depending on the user settings in the workflow-config :
An anvi’o installation that follows the recommendations on the installation page will include all these programs. But please consider your settings, and cite these additional tools from your methods sections.
The sra_download
workflow is a Snakemake workflow that downloads FASTQ files from SRA-accessions from NCBI e.g. SRR000001 and ERR000001. using NCBI sra-tools wiki, gzips them using pigz, and provides a samples-txt. You will need to have these tools installed before you start.
Let’s get started.
The first step is to make a workflow-config.
anvi-run-workflow -w sra_download --get-default-config sra_download_config.json
Here’s what the workflow-config file looks like:
$ cat sra_download_config.json
{
"SRA_accession_list": "SRA_accession_list.txt",
"prefetch": {
"--max-size": "40g",
"threads": 2
},
"fasterq_dump": {
"threads": 6
},
"pigz": {
"threads": 8,
"--processes": ""
},
"output_dirs": {
"SRA_prefetch": "01_NCBI_SRA",
"FASTAS": "02_FASTA",
"LOGS_DIR": "00_LOGS"
},
"max_threads": "",
"config_version": "3",
"workflow_name": "sra_download"
If this is the first time using an anvi’o Snakemake workflow, check out Alon’s blog post first.
Feel free to adjust anything in the config file! Here are some to consider:
threads
: this can be optimized for any of the steps depending on the size and number of SRA accessions you are downloaded.prefetch
--max-size
: The default is 40g but maybe you need more! For reference, this --max-size
can download TARA Ocean metagenomes. You can use vdb-dump --info
to learn how much the prefetch
step will download e.g. vdb-dump SRR000001 --info
. Read more about that here.The input for the sra_download
workflow is SRA_accession_list.txt
. This contains a list of your SRA accessions you would like to download and it looks like this:
All SRA accessions begin with the prefix SRR
or ERR
to denote their uploads to NCBI or EBI respectively.
$ cat SRA_accession_list.txt
ERR6450080
ERR6450081
SRR5965623
The .sra files are stored in 01_NCBI_SRA/
. This directory will be deleted upon successful completion of the workflow because I don’t know any use for .sra files. If you need these feel free to update the workflow.
Here’s a basic command to start the workflow:
anvi-run-workflow -w sra_download -c sra_download_config.json
The power of Snakemake shines when you can leverage a High Performance Computing system to parallelize jobs. Check out the Snakemake cluster documentation on how to launch this workflow on your own HPC.
Here is how to use the sra_download
workflow to download all of the sequencing files from an NCBI BioSample:
All Databases
on the NCBI website.Genomes
click SRA
Send to:
and then Run Selector
Metadata
or Accession list
to download a text file with ALL of the SRA accesssions associated with the BioSample. Put the SRA accessions into the SRA_accession_list.txt
and start the workflow!Edit this file to update this information.