perf: pooled dada under-utilizes cores (serial bud/shuffle/p_update) — profile & parallelize

## Observation

In the benchmark, the pooled `dada` step uses far fewer effective cores than the
per-sample modes, most starkly on MiSeq:

| platform | mode | dada cores (of 24) |
|---|---|---:|
| MiSeq | pooled | **9.9–12.9** (dada_rev / dada_fwd) |
| MiSeq | pseudo | 19.9–20.8 |
| MiSeq | nopool | 19.0–19.9 |
| PacBio | pooled | 18.8 |
| PacBio | pseudo / nopool | 21.0–21.7 |

So pooled is the only mode that under-fills cores, and short reads make it worse.

## Why (grounded in the code)

`run_dada` (src/dada.rs:510) is explicitly Amdahl-structured — its own comment:
*"Only `b_compare_parallel` is multithreaded; shuffle/bud/p_update are serial."*
Each loop iteration buds one cluster (serial `b_bud`), aligns all raws against
centers (**parallel** `b_compare_parallel`), then `b_p_update` + shuffle
(**serial**). Effective cores ≈ parallel_time / total_time × threads, so low
utilization = high serial fraction.

Hypothesis for the platform split: the serial phases scale with `nraw × nclust`,
while the parallel phase scales with alignment cost. Short MiSeq reads (240 bp)
make each alignment cheap, so the serial per-round bookkeeping dominates; long
PacBio reads (1500 bp) make the parallel compare expensive, so it dominates and
cores stay high. Pooling has no across-sample concurrency axis (unlike
pseudo/nopool with `--sample-jobs`), so it's fully exposed to this.

## First step — it's already instrumented

`run_dada` keeps phase timers `t_compare / t_shuffle / t_bud / t_pupdate`.
Surface them (verbose or aux) for a MiSeq pooled run to quantify the serial
breakdown and identify which serial phase to attack.

## Then

If `b_p_update` / `b_shuffle` / `b_bud` dominate, evaluate parallelizing them —
they're per-raw loops (p-value updates, max-pval search for budding) amenable to
a rayon reduction. A win here would also help the *tail* of pseudo/nopool, not
just pooled.

## Priority

Lower than the memory tickets: pooled is the least-recommended mode (pseudo/nopool
are preferred and already utilize cores well). File for visibility; chase if the
profile shows a cheap, high-leverage serial phase.

Noted in docs/results.md (MiSeq per-step section).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: pooled dada under-utilizes cores (serial bud/shuffle/p_update) — profile & parallelize #33

Observation

Why (grounded in the code)

First step — it's already instrumented

Then

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

platform	mode	dada cores (of 24)
MiSeq	pooled	9.9–12.9 (dada_rev / dada_fwd)
MiSeq	pseudo	19.9–20.8
MiSeq	nopool	19.0–19.9
PacBio	pooled	18.8
PacBio	pseudo / nopool	21.0–21.7

perf: pooled dada under-utilizes cores (serial bud/shuffle/p_update) — profile & parallelize #33

Description

Observation

Why (grounded in the code)

First step — it's already instrumented

Then

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions