perf/io: encode sequences (2-bit) to shrink resident + on-disk footprint (follow-up to #23)

## Background

#23 (PR #24) made the per-position quality field compact (f64 mean → u32 sum), cutting resident RSS ~34% in cached pseudo and ~34% raw / ~10% gz on disk (see #1). With quals shrunk, the **sequence** is now the next-largest field, both:

- **resident** — `RawInput.seq: String` and `Raw.seq: Vec<u8>` store **1 byte per base**, and
- **on disk** — the `sequence` strings dominate the compact derep/sample JSON (PacBio ~1500 bp ACGT per unique).

## Idea

Explore 2-bit sequence encoding (A/C/G/T = 2 bits, with an N/ambiguity escape) for the stored sequence.

## Why this is a bigger lift than the quals change

The quals change was a drop-in representation swap because nothing in the hot loop consumed the f64 directly (the kernel rounds to u8 anyway). Sequences are different:

- `nt_encode` produces **1-byte codes** (A=1, C=2, G=3, T=4, N=5) — the NW/alignment and k-mer paths operate on these one-byte-per-base arrays, **not** a 2-bit packed form.
- So a packed `seq` would need either (a) unpacking to the 1-byte form at the point of use (alignment, k-mer build), or (b) reworking those kernels to operate on packed data directly.

That means a real perf trade-off to measure (decode cost vs memory saved), unlike #23 which was free.

## Notes / scoping

- **Resident is the real target.** gzip already compresses ACGT to ~2 bits/base, so the *on-disk gzipped* win from 2-bit may be small (mirrors how the quals gzip win was modest). RAM is where 2-bit helps unconditionally — gzip doesn't help the resident working set.
- Consider an SoA / packed `Vec<u8>` per sample rather than per-`RawInput` `String` to also cut allocation overhead.
- N/ambiguity handling: escape table or a parallel sparse map of non-ACGT positions.

## Acceptance

- Byte-identical denoise output on the fixtures (keep `dada_from_fastq_matches_dada_from_derep_json` green).
- Measure: resident peak RSS delta (cached pseudo, PacBio) **and** NW throughput / wall delta from any decode overhead — the latter is the gate, since the alignment path is hot.

Related: #23 (quals, in-memory counterpart, done), #1 (I/O format).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf/io: encode sequences (2-bit) to shrink resident + on-disk footprint (follow-up to #23) #28

Background

Idea

Why this is a bigger lift than the quals change

Notes / scoping

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf/io: encode sequences (2-bit) to shrink resident + on-disk footprint (follow-up to #23) #28

Description

Background

Idea

Why this is a bigger lift than the quals change

Notes / scoping

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions