Skip to content

perf/io: encode sequences (2-bit) to shrink resident + on-disk footprint (follow-up to #23) #28

@cjfields

Description

@cjfields

Background

#23 (PR #24) made the per-position quality field compact (f64 mean → u32 sum), cutting resident RSS ~34% in cached pseudo and ~34% raw / ~10% gz on disk (see #1). With quals shrunk, the sequence is now the next-largest field, both:

  • residentRawInput.seq: String and Raw.seq: Vec<u8> store 1 byte per base, and
  • on disk — the sequence strings dominate the compact derep/sample JSON (PacBio ~1500 bp ACGT per unique).

Idea

Explore 2-bit sequence encoding (A/C/G/T = 2 bits, with an N/ambiguity escape) for the stored sequence.

Why this is a bigger lift than the quals change

The quals change was a drop-in representation swap because nothing in the hot loop consumed the f64 directly (the kernel rounds to u8 anyway). Sequences are different:

  • nt_encode produces 1-byte codes (A=1, C=2, G=3, T=4, N=5) — the NW/alignment and k-mer paths operate on these one-byte-per-base arrays, not a 2-bit packed form.
  • So a packed seq would need either (a) unpacking to the 1-byte form at the point of use (alignment, k-mer build), or (b) reworking those kernels to operate on packed data directly.

That means a real perf trade-off to measure (decode cost vs memory saved), unlike #23 which was free.

Notes / scoping

  • Resident is the real target. gzip already compresses ACGT to ~2 bits/base, so the on-disk gzipped win from 2-bit may be small (mirrors how the quals gzip win was modest). RAM is where 2-bit helps unconditionally — gzip doesn't help the resident working set.
  • Consider an SoA / packed Vec<u8> per sample rather than per-RawInput String to also cut allocation overhead.
  • N/ambiguity handling: escape table or a parallel sparse map of non-ACGT positions.

Acceptance

  • Byte-identical denoise output on the fixtures (keep dada_from_fastq_matches_dada_from_derep_json green).
  • Measure: resident peak RSS delta (cached pseudo, PacBio) and NW throughput / wall delta from any decode overhead — the latter is the gate, since the alignment path is hot.

Related: #23 (quals, in-memory counterpart, done), #1 (I/O format).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions