You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The scalar (non-SIMD) Needleman–Wunsch path in src/nwalign.rs is untouched perf territory and is now easier to hit than we'd assumed: any homo_gap_p != gap_p setting forces it, including a stray --homo-gap-p -1 on standard PacBio (see #20 / recent homo-gap work). It mirrors R disabling VECTORIZED_ALIGNMENT when HOMOPOLYMER_GAP_PENALTY != GAP_PENALTY (dada.R:229-230), so it's correct — just slow, and slower than it needs to be.
Where the four kernels dispatch (raw_align_with_buf, nwalign.rs:1155-1199)
Trigger
Function
Tables
Band storage
band==0 or gapless shortcut
align_gapless_with_buf
—
—
vectorized && len<3500 && homo==gap
align_vectorized_with_buf
i16 SIMD
diagonal, O(N·band)
homo_gap_p != gap_p
align_endsfree_homo_with_buf
i32 scalar
full O(N²) matrix
else (incl. len ≥ 3500, or vectorized=false)
align_endsfree_with_buf
i32 scalar
full O(N²) matrix
So the homopolymer trigger and the ≥3500 long-read trigger land on two different functions (align_endsfree_homo_with_buf vs align_endsfree_with_buf), but they're the same performance class: scalar i32 banded DP, no SIMD. If a read is both long and homo-gapping, use_homo wins (and it's i32, so no overflow cap is needed).
The two functions are near line-for-line duplicates; the only difference is the per-cell gap penalty:
// align_endsfree_with_buf align_endsfree_homo_with_buflet left = ... + gap_p;let left = ... if homo2[j-1]{homo_gap_p}else{gap_p};let up = ... + gap_p;let up = ... if homo1[i-1]{homo_gap_p}else{gap_p};
plus the homo version precomputes homopolymer_mask_into for both sequences. Banding, traceback, and the matrix layout are identical.
The main inefficiency
Both scalar kernels allocate and zero the full nrow*ncol matrix (reset_buf(&mut buf.d32, nrow*ncol, 0) at nwalign.rs:270 and :376), then fill_band_sentinels over it — even though only a band of width ~band is ever computed. That's O(N²) memory + O(N²) zeroing for an O(N·band) computation. For a ~1500 bp 16S read that's ~2.25M i32 cells zeroed per alignment; at 3500 bp, ~12M. The vectorized kernel stores only the diagonal band (O(N·band)).
So when homopolymer gapping flips PacBio onto the scalar path, we lose SIMD and pay quadratic zeroing/cache traffic the band doesn't need.
Follow-up directions (rough order of payoff)
Band-only storage for the scalar DP — store O(N·band) like the vectorized path instead of the full matrix. Biggest win; helps both scalar kernels and especially long reads.
Unify the two functions — one banded DP taking an optional homopolymer mask (plain = all-false mask), eliminating the duplication and drift risk.
Beyond-R: a homopolymer-aware vectorized kernel — R has none (it just disables SIMD), so this is net-new. Higher effort; only worth it if homo-gapping becomes a common path (it shouldn't for HiFi).
Notes / related
A --verbose note now fires when this path engages ([align] note: homopolymer gapping active ..., commit a20fee4), so it's visible in per-step benchmark logs.
Summary
The scalar (non-SIMD) Needleman–Wunsch path in
src/nwalign.rsis untouched perf territory and is now easier to hit than we'd assumed: anyhomo_gap_p != gap_psetting forces it, including a stray--homo-gap-p -1on standard PacBio (see #20 / recent homo-gap work). It mirrors R disablingVECTORIZED_ALIGNMENTwhenHOMOPOLYMER_GAP_PENALTY != GAP_PENALTY(dada.R:229-230), so it's correct — just slow, and slower than it needs to be.Where the four kernels dispatch (
raw_align_with_buf,nwalign.rs:1155-1199)band==0or gapless shortcutalign_gapless_with_bufvectorized && len<3500 && homo==gapalign_vectorized_with_bufhomo_gap_p != gap_palign_endsfree_homo_with_bufvectorized=false)align_endsfree_with_bufSo the homopolymer trigger and the ≥3500 long-read trigger land on two different functions (
align_endsfree_homo_with_bufvsalign_endsfree_with_buf), but they're the same performance class: scalar i32 banded DP, no SIMD. If a read is both long and homo-gapping,use_homowins (and it's i32, so no overflow cap is needed).The two functions are near line-for-line duplicates; the only difference is the per-cell gap penalty:
plus the homo version precomputes
homopolymer_mask_intofor both sequences. Banding, traceback, and the matrix layout are identical.The main inefficiency
Both scalar kernels allocate and zero the full
nrow*ncolmatrix (reset_buf(&mut buf.d32, nrow*ncol, 0)atnwalign.rs:270and:376), thenfill_band_sentinelsover it — even though only a band of width ~bandis ever computed. That's O(N²) memory + O(N²) zeroing for an O(N·band) computation. For a ~1500 bp 16S read that's ~2.25M i32 cells zeroed per alignment; at 3500 bp, ~12M. The vectorized kernel stores only the diagonal band (O(N·band)).So when homopolymer gapping flips PacBio onto the scalar path, we lose SIMD and pay quadratic zeroing/cache traffic the band doesn't need.
Follow-up directions (rough order of payoff)
Notes / related
--verbosenote now fires when this path engages ([align] note: homopolymer gapping active ..., commit a20fee4), so it's visible in per-step benchmark logs.BimeraAlignParamshas nohomo_gap_p).