Background
#23 (PR #24) made the per-position quality field compact (f64 mean → u32 sum), cutting resident RSS ~34% in cached pseudo and ~34% raw / ~10% gz on disk (see #1). With quals shrunk, the sequence is now the next-largest field, both:
- resident —
RawInput.seq: String and Raw.seq: Vec<u8> store 1 byte per base, and
- on disk — the
sequence strings dominate the compact derep/sample JSON (PacBio ~1500 bp ACGT per unique).
Idea
Explore 2-bit sequence encoding (A/C/G/T = 2 bits, with an N/ambiguity escape) for the stored sequence.
Why this is a bigger lift than the quals change
The quals change was a drop-in representation swap because nothing in the hot loop consumed the f64 directly (the kernel rounds to u8 anyway). Sequences are different:
nt_encode produces 1-byte codes (A=1, C=2, G=3, T=4, N=5) — the NW/alignment and k-mer paths operate on these one-byte-per-base arrays, not a 2-bit packed form.
- So a packed
seq would need either (a) unpacking to the 1-byte form at the point of use (alignment, k-mer build), or (b) reworking those kernels to operate on packed data directly.
That means a real perf trade-off to measure (decode cost vs memory saved), unlike #23 which was free.
Notes / scoping
- Resident is the real target. gzip already compresses ACGT to ~2 bits/base, so the on-disk gzipped win from 2-bit may be small (mirrors how the quals gzip win was modest). RAM is where 2-bit helps unconditionally — gzip doesn't help the resident working set.
- Consider an SoA / packed
Vec<u8> per sample rather than per-RawInput String to also cut allocation overhead.
- N/ambiguity handling: escape table or a parallel sparse map of non-ACGT positions.
Acceptance
- Byte-identical denoise output on the fixtures (keep
dada_from_fastq_matches_dada_from_derep_json green).
- Measure: resident peak RSS delta (cached pseudo, PacBio) and NW throughput / wall delta from any decode overhead — the latter is the gate, since the alignment path is hot.
Related: #23 (quals, in-memory counterpart, done), #1 (I/O format).
Background
#23 (PR #24) made the per-position quality field compact (f64 mean → u32 sum), cutting resident RSS ~34% in cached pseudo and ~34% raw / ~10% gz on disk (see #1). With quals shrunk, the sequence is now the next-largest field, both:
RawInput.seq: StringandRaw.seq: Vec<u8>store 1 byte per base, andsequencestrings dominate the compact derep/sample JSON (PacBio ~1500 bp ACGT per unique).Idea
Explore 2-bit sequence encoding (A/C/G/T = 2 bits, with an N/ambiguity escape) for the stored sequence.
Why this is a bigger lift than the quals change
The quals change was a drop-in representation swap because nothing in the hot loop consumed the f64 directly (the kernel rounds to u8 anyway). Sequences are different:
nt_encodeproduces 1-byte codes (A=1, C=2, G=3, T=4, N=5) — the NW/alignment and k-mer paths operate on these one-byte-per-base arrays, not a 2-bit packed form.seqwould need either (a) unpacking to the 1-byte form at the point of use (alignment, k-mer build), or (b) reworking those kernels to operate on packed data directly.That means a real perf trade-off to measure (decode cost vs memory saved), unlike #23 which was free.
Notes / scoping
Vec<u8>per sample rather than per-RawInputStringto also cut allocation overhead.Acceptance
dada_from_fastq_matches_dada_from_derep_jsongreen).Related: #23 (quals, in-memory counterpart, done), #1 (I/O format).