The anvi'o 'sra-download' workflow

Authors
Artifacts accepted
Artifacts produced
Third party programs
Workflow description and usage
Required input

Configuration file

Modify any of the bells and whistles in the config file

List of SRA accessions

Start the workflow!

Run on your local computer
Go big and use an HPC!

Download, extract, and gzip paired-end FASTQ files automatically from the NCBI short-read archive (SRA)

The sra-download workflow automatizes the process of downloading paired-end FASTQ files for a given list of SRA-accessions using NCBI sra-tools wiki then gzips them using pigz.

🔙 To the main page of anvi’o programs and artifacts.

Authors

Matthew Schechter

Artifacts accepted

The sra-download can typically be initiated with the following artifacts:

workflow-config

Artifacts produced

The sra-download typically produce the following anvi’o artifacts:

paired-end-fastq

Third party programs

This is a list of programs that may be used by the sra-download workflow depending on the user settings in the workflow-config :

prefetch (Downloads SRA accessions)
fasterq-dump (Extracts FASTQ files from SRA accessions)
pigz (Compresses FASTQ files in parallel)

An anvi’o installation that follows the recommendations on the installation page will include all these programs. But please consider your settings, and cite these additional tools from your methods sections.

Workflow description and usage

The sra-download workflow is a Snakemake workflow that downloads FASTQ files from SRA-accessions using NCBI sra-tools wiki, gzips them using pigz, and provides a samples-txt. You will need to have these tools installed before you start.

Let’s get started.

Required input

Configuration file

The first step is to make a workflow-config.

anvi-run-workflow -w sra-download --get-default-config sra_download_config.json

Here’s what the workflow-config file looks like:

$ cat sra_download_config.json
{
    "SRA_accession_list": "SRA_accession_list.txt",
    "prefetch": {
        "--max-size": "40g",
        "threads": 2
    },
    "fasterq_dump": {
        "threads": 6
    },
    "pigz": {
        "threads": 8,
        "--processes": ""
    },
    "output_dirs": {
        "SRA_prefetch": "01_NCBI_SRA",
        "FASTAS": "02_FASTA",
        "LOGS_DIR": "00_LOGS"
    },
    "max_threads": "",
    "config_version": "3",
    "workflow_name": "sra-download"

Modify any of the bells and whistles in the config file

If this is the first time using an anvi’o Snakemake workflow, I would check out Alon’s blog post first.

Feel free to adjust anything in the config file! Here are some to consider:

threads: this can be optimized for any of the steps depending on the size and number of SRA accessions you are downloaded.
prefetch --max-size: I already upped the amount from the default 40g but maybe you need more! For reference, I can download TARA Ocean metagenomes with the current parameter. You can use vdb-dump --info to learn how much the the prefetch step will download e.g. vdb-dump SRR000001 --info. Read more about that here.

List of SRA accessions

The input for the sra-download workflow is SRA_accession_list.txt. This contains a list of your SRA accession you would like to download and it looks like this:

$ cat SRA_accession_list.txt
ERR6450080
ERR6450081
SRR5965623

The .sra files are stored in 01_NCBI_SRA/. This directory will be deleted upon successful completion of the workflow because I don’t know any use for .sra files. If you need these feel free to update the workflow.

Start the workflow!

Here’s a basic command to start the workflow:

Run on your local computer

anvi-run-workflow -w sra-download -c sra_download_config.json

Go big and use an HPC!

The power of Snakemake shines when you can leverage a High Performance Computing system to parallize jobs. Check out the Snakemake cluster documentation on how to launch this workflow on your own HPC.

Edit this file to update this information.