Generate synthetic sequencing reads (Illumina, PacBio HiFi, ONT) from reference FASTA files with optional SNV injection at controlled multi-allele frequencies.
🔙 To the main page of anvi’o programs and artifacts.
paired-end-fastq
single-end-fastq ![]()
This program generates synthetic sequencing reads from a reference fasta file. It supports Illumina paired-end and single-end short reads, PacBio HiFi, and Oxford Nanopore long reads, with optional SNV injection at controlled multi-allele frequencies.
The key feature is multi-allele SNV injection: you can specify exactly which positions should be variable and what the base frequencies should be at each position. This is critical for benchmarking tools that rely on SNV patterns where real biological variability involves 3-4 different bases at a position rather than just 2.
The easiest way to use this program is with a preset. Presets set sensible defaults for read length, insert size, error rate, and quality scores:
anvi-script-gen-reads -f fasta \ -o OUTPUT_PREFIX \ --preset illumina-paired
This generates OUTPUT_PREFIX-R1.fastq and OUTPUT_PREFIX-R2.fastq with 150 bp paired-end reads at 50X coverage.
Available presets:
Short reads (Illumina):
| Preset | Read type | Length | Insert size | Error rate | Quality |
|---|---|---|---|---|---|
illumina-paired |
paired-end | 150 bp | 450 bp (std 50) | 0.5% | ? (Q30) |
illumina-single |
single-end | 150 bp | - | 0.5% | ? (Q30) |
Long reads (PacBio):
| Preset | Read type | Length | Distribution | Error rate | Quality | Notes |
|---|---|---|---|---|---|---|
pacbio-hifi |
long-distributed | 15 kb (std 3.5 kb) | normal | 0.1% | F (Q37) | Modern HiFi circular consensus sequencing |
pacbio-clr |
long-distributed | 15 kb (std 8 kb) | normal | 12% | . (Q13) | Legacy CLR (pre-HiFi), noisy but long |
Long reads (Oxford Nanopore):
| Preset | Read type | Length | Distribution | Error rate | Quality | Notes |
|---|---|---|---|---|---|---|
ont-r9 |
long-distributed | 5 kb (std 4 kb) | lognormal | 6% | 3 (Q18) | Legacy R9.4.1 chemistry |
ont-r10 |
long-distributed | 8 kb (std 5 kb) | lognormal | 1% | = (Q28) | Modern R10.4.1 with super-accuracy basecalling |
ont-ultralong |
long-distributed | 50 kb (std 40 kb) | lognormal | 2% | : (Q25) | PromethION ultralong runs, very high variance |
ONT presets use a lognormal length distribution, which produces the right-skewed shape typical of nanopore data (mode lower than mean, with a long tail of very long reads). PacBio presets use a normal distribution, which better reflects the tighter length control of SMRT sequencing.
You can override any preset parameter individually. For example, to use the Illumina paired-end preset but with 100X coverage and 250 bp reads:
anvi-script-gen-reads -f fasta \ -o OUTPUT_PREFIX \ --preset illumina-paired \ --coverage 100 \ --read-length 250
To inject SNVs at specific positions with controlled allele frequencies, provide a TAB-delimited mutations file:
anvi-script-gen-reads -f fasta \ -o OUTPUT_PREFIX \ --preset illumina-paired \ --mutations-file mutations.tsv
The mutations file must have the following columns:
contig_name position freq_A freq_T freq_C freq_G
contig_1 1000 0.25 0.25 0.25 0.25
contig_1 2000 0.0 0.4 0.3 0.3
contig_2 500 0.5 0.5 0.0 0.0
Positions are 0-indexed and frequencies must sum to 1.0 for each row. Each frequency represents the probability that a read covering that position will carry that base. For example, a position with freq_A=0.25, freq_T=0.25, freq_C=0.25, freq_G=0.25 will have all four bases represented equally across reads – the kind of multi-allele variability you see in real DGR variable regions.
If you don’t need precise control over SNV positions, you can have the program randomly place SNVs at a given density:
anvi-script-gen-reads -f fasta \ -o OUTPUT_PREFIX \ --preset illumina-paired \ --snv-density 0.01 \ --num-alleles 3
This places SNVs at approximately 1% of positions (1 per 100 bp), each with 3 different alleles at random frequencies. --num-alleles can be 2, 3, or 4.
You can specify all parameters manually instead of using a preset:
anvi-script-gen-reads -f fasta \ -o OUTPUT_PREFIX \ --read-type paired-end \ --read-length 150 \ --insert-size 300 \ --insert-size-std 50 \ --coverage 100 \ --error-rate 0.005
Available read types are paired-end, single-end, long-fixed, and long-distributed.
All runs are deterministic by default (seed = 42). To get a different random realization, change the seed:
anvi-script-gen-reads -f fasta \ -o OUTPUT_PREFIX \ --preset illumina-paired \ --seed 123
Generate reads, then run the standard anvi’o pipeline:
anvi-script-gen-reads -f reference.fa \ -o sample_01 \ --preset illumina-paired \ --coverage 100 \ --mutations-file mutations.tsv
anvi-gen-contigs-database -f reference.fa \ -o contigs.db
bowtie2-build reference.fa reference bowtie2 -x reference \ -1 sample_01-R1.fastq \ -2 sample_01-R2.fastq \ -S sample_01.sam
samtools view -bS sample_01.sam | samtools sort -o sample_01-raw.bam anvi-init-bam sample_01-raw.bam -o sample_01.bam
anvi-profile -i sample_01.bam \ -c contigs.db \ -o sample_01_profile
Edit this file to update this information.
Are you aware of resources that may help users better understand the utility of this program? Please feel free to edit this file on GitHub. If you are not sure how to do that, find the __resources__ tag in this file to see an example.