perf: optimize the scalar NW path (band-only storage; unify homo/plain kernels)

## Summary

The scalar (non-SIMD) Needleman–Wunsch path in `src/nwalign.rs` is untouched perf territory and is now easier to hit than we'd assumed: **any `homo_gap_p != gap_p` setting forces it**, including a stray `--homo-gap-p -1` on standard PacBio (see #20 / recent homo-gap work). It mirrors R disabling `VECTORIZED_ALIGNMENT` when `HOMOPOLYMER_GAP_PENALTY != GAP_PENALTY` (`dada.R:229-230`), so it's correct — just slow, and slower than it needs to be.

## Where the four kernels dispatch (`raw_align_with_buf`, `nwalign.rs:1155-1199`)

| Trigger | Function | Tables | Band storage |
|---|---|---|---|
| `band==0` or gapless shortcut | `align_gapless_with_buf` | — | — |
| `vectorized && len<3500 && homo==gap` | `align_vectorized_with_buf` | **i16 SIMD** | diagonal, O(N·band) |
| **`homo_gap_p != gap_p`** | `align_endsfree_homo_with_buf` | i32 scalar | **full O(N²) matrix** |
| else (incl. **len ≥ 3500**, or `vectorized=false`) | `align_endsfree_with_buf` | i32 scalar | **full O(N²) matrix** |

So the homopolymer trigger and the ≥3500 long-read trigger land on **two different functions** (`align_endsfree_homo_with_buf` vs `align_endsfree_with_buf`), but they're the **same performance class**: scalar i32 banded DP, no SIMD. If a read is both long and homo-gapping, `use_homo` wins (and it's i32, so no overflow cap is needed).

The two functions are near line-for-line duplicates; the only difference is the per-cell gap penalty:

```rust
// align_endsfree_with_buf            align_endsfree_homo_with_buf
let left = ... + gap_p;               let left = ... if homo2[j-1] {homo_gap_p} else {gap_p};
let up   = ... + gap_p;               let up   = ... if homo1[i-1] {homo_gap_p} else {gap_p};
```
plus the homo version precomputes `homopolymer_mask_into` for both sequences. Banding, traceback, and the matrix layout are identical.

## The main inefficiency

Both scalar kernels **allocate and zero the full `nrow*ncol` matrix** (`reset_buf(&mut buf.d32, nrow*ncol, 0)` at `nwalign.rs:270` and `:376`), then `fill_band_sentinels` over it — even though only a band of width ~`band` is ever computed. That's **O(N²) memory + O(N²) zeroing for an O(N·band) computation.** For a ~1500 bp 16S read that's ~2.25M i32 cells zeroed per alignment; at 3500 bp, ~12M. The vectorized kernel stores only the diagonal band (O(N·band)).

So when homopolymer gapping flips PacBio onto the scalar path, we lose SIMD **and** pay quadratic zeroing/cache traffic the band doesn't need.

## Follow-up directions (rough order of payoff)

1. **Band-only storage for the scalar DP** — store O(N·band) like the vectorized path instead of the full matrix. Biggest win; helps both scalar kernels and especially long reads.
2. **Unify the two functions** — one banded DP taking an optional homopolymer mask (plain = all-false mask), eliminating the duplication and drift risk.
3. **Beyond-R: a homopolymer-aware vectorized kernel** — R has none (it just disables SIMD), so this is net-new. Higher effort; only worth it if homo-gapping becomes a common path (it shouldn't for HiFi).

## Notes / related

- A `--verbose` note now fires when this path engages (`[align] note: homopolymer gapping active ...`, commit a20fee4), so it's visible in per-step benchmark logs.
- Related to the #20 item about making bimera alignment homopolymer-aware — same kernel family (`BimeraAlignParams` has no `homo_gap_p`).
- No correctness concern here — this is purely a performance/cleanup follow-up.


Trigger	Function	Tables	Band storage
`band==0` or gapless shortcut	`align_gapless_with_buf`	—	—
`vectorized && len<3500 && homo==gap`	`align_vectorized_with_buf`	i16 SIMD	diagonal, O(N·band)
`homo_gap_p != gap_p`	`align_endsfree_homo_with_buf`	i32 scalar	full O(N²) matrix
else (incl. len ≥ 3500, or `vectorized=false`)	`align_endsfree_with_buf`	i32 scalar	full O(N²) matrix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize the scalar NW path (band-only storage; unify homo/plain kernels) #21

Summary

Where the four kernels dispatch (`raw_align_with_buf`, `nwalign.rs:1155-1199`)

The main inefficiency

Follow-up directions (rough order of payoff)

Notes / related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf: optimize the scalar NW path (band-only storage; unify homo/plain kernels) #21

Description

Summary

Where the four kernels dispatch (raw_align_with_buf, nwalign.rs:1155-1199)

The main inefficiency

Follow-up directions (rough order of payoff)

Notes / related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Where the four kernels dispatch (`raw_align_with_buf`, `nwalign.rs:1155-1199`)