Skip to content

perf: optimize the scalar NW path (band-only storage; unify homo/plain kernels) #21

@cjfields

Description

@cjfields

Summary

The scalar (non-SIMD) Needleman–Wunsch path in src/nwalign.rs is untouched perf territory and is now easier to hit than we'd assumed: any homo_gap_p != gap_p setting forces it, including a stray --homo-gap-p -1 on standard PacBio (see #20 / recent homo-gap work). It mirrors R disabling VECTORIZED_ALIGNMENT when HOMOPOLYMER_GAP_PENALTY != GAP_PENALTY (dada.R:229-230), so it's correct — just slow, and slower than it needs to be.

Where the four kernels dispatch (raw_align_with_buf, nwalign.rs:1155-1199)

Trigger Function Tables Band storage
band==0 or gapless shortcut align_gapless_with_buf
vectorized && len<3500 && homo==gap align_vectorized_with_buf i16 SIMD diagonal, O(N·band)
homo_gap_p != gap_p align_endsfree_homo_with_buf i32 scalar full O(N²) matrix
else (incl. len ≥ 3500, or vectorized=false) align_endsfree_with_buf i32 scalar full O(N²) matrix

So the homopolymer trigger and the ≥3500 long-read trigger land on two different functions (align_endsfree_homo_with_buf vs align_endsfree_with_buf), but they're the same performance class: scalar i32 banded DP, no SIMD. If a read is both long and homo-gapping, use_homo wins (and it's i32, so no overflow cap is needed).

The two functions are near line-for-line duplicates; the only difference is the per-cell gap penalty:

// align_endsfree_with_buf            align_endsfree_homo_with_buf
let left = ... + gap_p;               let left = ... if homo2[j-1] {homo_gap_p} else {gap_p};
let up   = ... + gap_p;               let up   = ... if homo1[i-1] {homo_gap_p} else {gap_p};

plus the homo version precomputes homopolymer_mask_into for both sequences. Banding, traceback, and the matrix layout are identical.

The main inefficiency

Both scalar kernels allocate and zero the full nrow*ncol matrix (reset_buf(&mut buf.d32, nrow*ncol, 0) at nwalign.rs:270 and :376), then fill_band_sentinels over it — even though only a band of width ~band is ever computed. That's O(N²) memory + O(N²) zeroing for an O(N·band) computation. For a ~1500 bp 16S read that's ~2.25M i32 cells zeroed per alignment; at 3500 bp, ~12M. The vectorized kernel stores only the diagonal band (O(N·band)).

So when homopolymer gapping flips PacBio onto the scalar path, we lose SIMD and pay quadratic zeroing/cache traffic the band doesn't need.

Follow-up directions (rough order of payoff)

  1. Band-only storage for the scalar DP — store O(N·band) like the vectorized path instead of the full matrix. Biggest win; helps both scalar kernels and especially long reads.
  2. Unify the two functions — one banded DP taking an optional homopolymer mask (plain = all-false mask), eliminating the duplication and drift risk.
  3. Beyond-R: a homopolymer-aware vectorized kernel — R has none (it just disables SIMD), so this is net-new. Higher effort; only worth it if homo-gapping becomes a common path (it shouldn't for HiFi).

Notes / related

  • A --verbose note now fires when this path engages ([align] note: homopolymer gapping active ..., commit a20fee4), so it's visible in per-step benchmark logs.
  • Related to the Low-priority follow-ups: for_each_sample_concurrent reuse + benchmark wrap-up #20 item about making bimera alignment homopolymer-aware — same kernel family (BimeraAlignParams has no homo_gap_p).
  • No correctness concern here — this is purely a performance/cleanup follow-up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions