Observation
In the benchmark, the pooled dada step uses far fewer effective cores than the
per-sample modes, most starkly on MiSeq:
| platform |
mode |
dada cores (of 24) |
| MiSeq |
pooled |
9.9–12.9 (dada_rev / dada_fwd) |
| MiSeq |
pseudo |
19.9–20.8 |
| MiSeq |
nopool |
19.0–19.9 |
| PacBio |
pooled |
18.8 |
| PacBio |
pseudo / nopool |
21.0–21.7 |
So pooled is the only mode that under-fills cores, and short reads make it worse.
Why (grounded in the code)
run_dada (src/dada.rs:510) is explicitly Amdahl-structured — its own comment:
"Only b_compare_parallel is multithreaded; shuffle/bud/p_update are serial."
Each loop iteration buds one cluster (serial b_bud), aligns all raws against
centers (parallel b_compare_parallel), then b_p_update + shuffle
(serial). Effective cores ≈ parallel_time / total_time × threads, so low
utilization = high serial fraction.
Hypothesis for the platform split: the serial phases scale with nraw × nclust,
while the parallel phase scales with alignment cost. Short MiSeq reads (240 bp)
make each alignment cheap, so the serial per-round bookkeeping dominates; long
PacBio reads (1500 bp) make the parallel compare expensive, so it dominates and
cores stay high. Pooling has no across-sample concurrency axis (unlike
pseudo/nopool with --sample-jobs), so it's fully exposed to this.
First step — it's already instrumented
run_dada keeps phase timers t_compare / t_shuffle / t_bud / t_pupdate.
Surface them (verbose or aux) for a MiSeq pooled run to quantify the serial
breakdown and identify which serial phase to attack.
Then
If b_p_update / b_shuffle / b_bud dominate, evaluate parallelizing them —
they're per-raw loops (p-value updates, max-pval search for budding) amenable to
a rayon reduction. A win here would also help the tail of pseudo/nopool, not
just pooled.
Priority
Lower than the memory tickets: pooled is the least-recommended mode (pseudo/nopool
are preferred and already utilize cores well). File for visibility; chase if the
profile shows a cheap, high-leverage serial phase.
Noted in docs/results.md (MiSeq per-step section).
Observation
In the benchmark, the pooled
dadastep uses far fewer effective cores than theper-sample modes, most starkly on MiSeq:
So pooled is the only mode that under-fills cores, and short reads make it worse.
Why (grounded in the code)
run_dada(src/dada.rs:510) is explicitly Amdahl-structured — its own comment:"Only
b_compare_parallelis multithreaded; shuffle/bud/p_update are serial."Each loop iteration buds one cluster (serial
b_bud), aligns all raws againstcenters (parallel
b_compare_parallel), thenb_p_update+ shuffle(serial). Effective cores ≈ parallel_time / total_time × threads, so low
utilization = high serial fraction.
Hypothesis for the platform split: the serial phases scale with
nraw × nclust,while the parallel phase scales with alignment cost. Short MiSeq reads (240 bp)
make each alignment cheap, so the serial per-round bookkeeping dominates; long
PacBio reads (1500 bp) make the parallel compare expensive, so it dominates and
cores stay high. Pooling has no across-sample concurrency axis (unlike
pseudo/nopool with
--sample-jobs), so it's fully exposed to this.First step — it's already instrumented
run_dadakeeps phase timerst_compare / t_shuffle / t_bud / t_pupdate.Surface them (verbose or aux) for a MiSeq pooled run to quantify the serial
breakdown and identify which serial phase to attack.
Then
If
b_p_update/b_shuffle/b_buddominate, evaluate parallelizing them —they're per-raw loops (p-value updates, max-pval search for budding) amenable to
a rayon reduction. A win here would also help the tail of pseudo/nopool, not
just pooled.
Priority
Lower than the memory tickets: pooled is the least-recommended mode (pseudo/nopool
are preferred and already utilize cores well). File for visibility; chase if the
profile shows a cheap, high-leverage serial phase.
Noted in docs/results.md (MiSeq per-step section).