Skip to content

HPCBio/dada2-rs

Repository files navigation

dada2-rs

📖 Documentation: dada2-rs.readthedocs.io — installation, Illumina/PacBio walkthroughs, and the performance/benchmarking reference.

An experimental implementation of DADA2 in Rust, using Claude Code (specically Sonnet 4.6 and Opus 4.6/4.7/4.8) for the bulk of the work.

Implementations

Rust ports of:

Step DADA2 (R) dada2-rs
Filter/trimming FASTQ filterAndTrim filter-and-trim
Filter/trimming FASTQ (PacBio) removePrimers,filterAndTrim remove-primers (one step)
Dereplication derepFastq derep
Error models learnErrors learn-errors
Denoising dada dada
Merging mergePairs merge-pairs
Chimera removal removeBimeraDenovo remove-bimera-denovo
RDP taxonomic classifier assignTaxonomy + assignSpecies assign-taxonomy + assign-species
Merging sequence tables mergeSequenceTables make-sequence-table (accepts multiple inputs)
Making sequence tables from multiple inputs makeSequenceTable make-sequence-table
  • Current error models:
    • loessErrfun
    • PacBioErrfun
    • noqualErrfun
    • makeBinnedErrfun
    • Custom error models in R and/or Python - Experimental, tested
  • Other functionality:
    • Helper functions to convert sequence tables to TSV or FASTA (seq-table-to-fasta, seq-table-to-tsv)
    • Helper function to convert taxonomic output to TSV (tax-to-tsv)
    • Basic scripts and examples for comparing results between runs to trace differences
    • Experimental sub-sampling function for input FASTQ (sample) and related error model function (errors-from-sample) that mirrors learn-errors (can be used for bootstrapping)
    • Intermediate outputs (in JSON) - can be evaluated for debugging purposes or for plotting in R, Python, etc.
    • Dedicated Docker builds available
  • In progress
    • Summary FASTQ metrics and plots

Building

Requires a recent stable Rust toolchain (rustup.rs).

cargo build --release
# binary at target/release/dada2-rs

For the native (-C target-cpu=native) build and Docker, see Installation.

Subcommands

Subcommand Description
filter-and-trim Filter and trim FASTQ reads (mirrors filterAndTrim)
derep Dereplicate a FASTQ file
sample Dereplicate and subsample FASTQ files, one JSON per sample
errors-from-sample Learn error model from derep JSON files
learn-errors Learn error model directly from FASTQ files
dada Denoise a sample using a learned error model
merge-pairs Merge denoised forward and reverse reads
make-sequence-table Build a sample × sequence count table
remove-bimera-denovo Remove chimeric sequences
summary Per-position quality metrics from a FASTQ

Run dada2-rs <subcommand> --help for full parameter documentation.

Usage & walkthroughs

End-to-end examples live in the documentation:

Run dada2-rs <subcommand> --help for the full parameter set of any step.

Benchmarks & comparison with R DADA2

Performance is benchmarked head-to-head against R DADA2 with the harness in comparison/benchmark/. See the docs for the tooling, metrics, and results:

AI Assistance Disclosure

See the project overview for the project's origins and goals, which are directly relevant here.

This tool was written with the assistance of AI coding agents, specifically Claude Code, using Sonnet and Opus LLMs. All commits using AI assistance are openly noted.

Correctness is validated by comparing output against DADA2 v1.36 on a suite of real sequencing datasets - not by manual code review alone. AI generated the implementation; humans defined the validation criteria, made some key coding updates, and verified results.

Citation

dada2-rs is a reimplementation; if you use it, please cite the original work that describes the algorithm (see CITATION.cff):

  • Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods. 2016;13:581-583. doi:10.1038/nmeth.3869
  • Rosen MJ, Callahan BJ, Fisher DS, Holmes SP. Denoising PCR-amplified metagenome data. BMC Bioinformatics. 2012;13:283. doi:10.1186/1471-2105-13-283