Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,18 @@ demo-seeder/ Operational tool: generates a signed demo WAL on an
wasm bundle.
```

## Verifying large WALs

`spine-cli verify` streams the WAL one line at a time, so peak memory is
flat regardless of size: on a 709 MB, 1M-record WAL it peaks at 4.4 MB.
For routine re-checks, `--chain-only` (verify the hash chain and root,
skip per-record signatures) runs about 9x faster, and
`--sample-signatures N` spot-checks one record in every N. Full
signature verification stays the default. See
[docs/verifying-at-scale.md](docs/verifying-at-scale.md) for the
measured numbers, the threat-model trade-offs of each policy, and the
sub-linear proof-based model for the largest scenarios.

## What this verifies, and what it does not

This codebase verifies that a given Spine WAL file is internally consistent and matches a pinned
Expand Down
193 changes: 193 additions & 0 deletions docs/verifying-at-scale.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Verifying large WALs

An auditor may receive months of a bank's write-ahead log on a disk:
billions of records, terabytes of JSONL. This note describes how the
open-source verifier handles that, what it guarantees, and where the
honest limits are. It separates three things:

1. What ships in this repository today (the verifier).
2. A verifier-side optimization that is designed but not built here.
3. Server-side work that is out of scope for this repository (it
belongs to the production server, which is not open source).

The numbers below are measured, not modelled, unless a row is labelled a
projection. The benchmark is a 1,000,000-record lenient-signed WAL
(709.5 MB on disk, ~740 bytes/record, 10 segments) verified by the
release CLI on a single core.

## 1. What ships today

### Streaming verification (constant memory)

`spine-cli verify` streams the WAL one line at a time. Peak memory is
flat regardless of total size: the verifier holds one line buffer plus
the running chain state (a few hashes and counters), never the WAL
itself.

| Run on the 709.5 MB / 1M-record WAL | Wall time | Peak RAM | Throughput |
|-------------------------------------|-----------|----------|------------|
| `verify` (full, every signature) | 35.4 s | 4.4 MB | 28,254 ev/s |
| `verify --chain-only` | 4.0 s | 4.4 MB | 249,875 ev/s |
| `verify --sample-signatures 1000` | 4.1 s | 4.5 MB | 246,609 ev/s |

The headline is the 4.4 MB column: it does not grow with the WAL. A
buffer-the-whole-file verifier would peak above the WAL size (roughly
1.2 GB on this input); streaming removes that ceiling entirely, so the
disk a bank ships is never bounded by the auditor's RAM.

The same state machine drives the buffered byte API used by the browser
playground and the cross-language vectors, so the streaming and buffered
paths cannot diverge.

### Signature policy

Verifying one Ed25519 signature per record dominates the cost: on the
benchmark it is about 93% of full-verify time. Walking the chain and
parsing JSON is roughly an order of magnitude cheaper. Three policies
let an auditor choose coverage:

- **Full (default).** Every signed record's signature is verified. The
only policy that defends against a targeted forger. O(n) in the number
of records.
- **`--chain-only`.** Verifies chain linkage, sequence, timestamps,
hash formats and the chain root; skips per-record signatures. About
9x faster here. It retains tamper-evidence only when paired with an
authenticated `--expected-root` (an external anchor the auditor trusts
out of band); without one it proves internal self-consistency only.
Incompatible with `--trusted-pubkey` and `--keystore`.
- **`--sample-signatures N`.** Verifies one record in every N (those
whose `sequence` is a multiple of N). A routine spot-check for
accidental corruption or a wrong-key rollout. It is NOT a defense
against a targeted forger, who can simply avoid the sampled positions;
a record left unchecked is reported in `signatures_skipped` and the
result carries a warning. A sampled signature that fails still fails
the whole run.

Reducing signature coverage never weakens the chain's own
tamper-evidence: chain, sequence, timestamp, hash-format and root checks
always run in full. A broken hash-chain link fails even under
`--chain-only`.

### What this means for an auditor's scenarios

Single core, constant ~5 MB RAM in every row. The first data row is
measured; the rest are projections that scale the measured throughput
linearly (each record is independent work, so this holds until disk I/O
becomes the floor).

| Records (approx size) | Full verify | `--chain-only` | Peak RAM |
|---------------------------|-------------|----------------|----------|
| 1M (0.7 GB, measured) | 35 s | 4 s | 4.4 MB |
| 180M (~130 GB) | ~1.8 h | ~12 min | ~5 MB |
| 1.8B (~1.3 TB) | ~17.7 h | ~2.0 h | ~5 MB |
| 18B (~13 TB) | ~7.4 days | ~20 h | ~5 MB |

At 13 TB the `--chain-only` compute time (~20 h) and the time to read
13 TB from disk are the same order of magnitude, so that row is
I/O-bound: wall clock is about a day either way, but still at a few MB
of RAM, on a laptop.

The takeaway: streaming removes the memory wall outright, and
`--chain-only` / `--sample-signatures` make the small and mid scenarios
a matter of minutes to a couple of hours. The largest scenario is the
one that wants a different verification model, described in section 3.

### Honest limits

- Full verification is O(n): there is no way to check every one of N
signatures in less than N signature verifications on one core.
- `--sample-signatures` trades completeness for speed and is not
adversarial: document the sample rate in the audit record.
- `--chain-only` is only as strong as the `--expected-root` you pin. An
unauthenticated root proves the log is self-consistent, not that it is
the log the bank committed to.

## 2. Designed, not built here: parallel signature verification

Signature verification of one record is independent of every other
record, so the full-verify path parallelizes across cores for a roughly
linear speed-up (for example ~8x on 8 cores), at bounded memory, by
verifying signatures for a window of records in parallel while the chain
walk stays sequential.

This is a verifier-side change and could live in this repository. It is
deliberately not in this release:

- The headline wins (the memory wall, and the order-of-magnitude speed-up
for routine runs) come from streaming and the signature policy, which
are simpler and provably correct.
- Doing it without forking the canonical lenient state machine means
adding threads to `spine-core`, whose value rests on being
single-source-of-truth, no-panic, and clean to compile to wasm (which
has no threads by default). That is a larger change than fits beside
the above.
- Parallelism speeds up the O(n) regime by a constant factor; it does
not change the asymptotics. For the largest scenario the right answer
is the sub-linear, proof-based model in section 3, not a faster brute
force.

Design sketch for when it is built: read records into a bounded window
(for example 8,192 at a time); verify that window's signatures on a
thread pool calling the existing per-record signature primitive in
`spine-core`; keep the chain link, sequence, timestamp and root checks
sequential; merge signature failures back by sequence so output stays
deterministic. Memory stays O(window). No change to the verdict, only to
how fast signatures are checked.

## 3. Out of scope here: sub-linear verification via proofs

The production server (ingestion, batch sealing, retention) is not in
this repository and is not open source. The verification model that
actually removes the "process every record" cost requires the server to
emit additional structure. This section records the design so the
auditor-facing proof-checking can be specified against it; none of it is
implemented in this repository.

### Tier 2: signed checkpoints

The server periodically signs a checkpoint over `(chain_root, length,
timestamp)`: a Signed Tree Head. An auditor verifies a chain of
checkpoints and their consistency instead of every event. The receipts
already carry a `batch_id`, and the README mentions batch sealing, so
the batch structure likely exists server-side already; what is missing
is exposing it to the auditor.

### Tier 3: Merkle transparency log

This is the model that scales to the largest scenario. It is the design
used by Certificate Transparency (RFC 6962) and Trillian to verify
billions of certificates.

The server maintains an append-only Merkle tree over the events and
publishes a signed tree head. With it, an auditor can verify:

- **Inclusion.** That specific events (the ones in the audit scope) are
in the committed log, each via an inclusion proof of about log2(N)
hashes. For 18 billion events that is about 35 hashes per event, not
18 billion.
- **Consistency.** That the current tree is an append-only extension of
an earlier tree head the auditor already trusts, via a consistency
proof, so nothing was edited or removed behind the scenes.

The auditor downloads the signed tree head, the events in scope, and
their proofs, and verifies in seconds at constant memory on a laptop,
without reading the bulk of the log at all. That is what turns the
largest scenario from days into seconds.

A future open-source addition to this repository would be the
auditor-side proof checker: given a signed tree head, a set of leaves,
and their inclusion and consistency proofs, verify them. That checker is
pure logic with no server dependency and would fit the same no-panic,
cross-language-vectored contract as the rest of `spine-core`. The
proof-emitting side stays in the production server.

## Recommended workflow

- Routine / periodic re-check of a large WAL: `--chain-only` with an
authenticated `--expected-root`, or `--sample-signatures N` to spot a
wrong-key rollout cheaply.
- One-time forensic audit where every signature must be checked: full
`verify` (streaming keeps it to a few MB of RAM); accept the O(n)
time, or parallelize once section 2 is built.
- Targeting specific events out of a very large log: the proof-based
model in section 3, once the server emits proofs.
25 changes: 25 additions & 0 deletions spine-cli/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,27 @@ enum Commands {
#[arg(long)]
strict: bool,

/// Verify chain linkage, sequence, timestamps and root only;
/// skip per-record Ed25519 signature verification. Signature
/// checks dominate the cost on a large WAL, so this is much
/// faster while still streaming at constant memory. It retains
/// tamper-evidence only when paired with an authenticated
/// `--expected-root`; without one it proves internal
/// self-consistency. Lenient profile only; incompatible with
/// `--trusted-pubkey`, `--keystore` and `--sample-signatures`.
#[arg(long)]
chain_only: bool,

/// Spot-check signatures by verifying one record in every N
/// (those whose `sequence` is a multiple of N) instead of all
/// of them. A routine check for accidental corruption or a
/// wrong-key rollout, NOT a defense against a targeted forger
/// who can avoid the sampled positions; a sampled signature
/// that fails still fails the run. Lenient profile only;
/// incompatible with `--chain-only`.
#[arg(long, value_name = "N")]
sample_signatures: Option<u64>,

/// Manifest version echoed into the strict report so a
/// consumer can pin the exact contract it expects. Strict
/// profile only; ignored in lenient mode.
Expand Down Expand Up @@ -234,6 +255,8 @@ fn main() -> ExitCode {
keystore,
trusted_pubkey,
strict,
chain_only,
sample_signatures,
manifest_version,
} => verify::run(
&wal,
Expand All @@ -243,6 +266,8 @@ fn main() -> ExitCode {
keystore.as_deref(),
trusted_pubkey.as_deref(),
strict,
chain_only,
sample_signatures,
manifest_version,
cli.format,
)
Expand Down
62 changes: 53 additions & 9 deletions spine-cli/src/verify.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ use std::fs;
use std::path::Path;

use spine_core::{
verify_demo_wal, verify_wal_bytes_with_options, DemoRecordOutcome, DemoReport, DemoStatus,
Keystore, LenientOptions, VerificationResult,
verify_demo_wal, DemoRecordOutcome, DemoReport, DemoStatus, Keystore, LenientOptions,
LenientVerifier, SignaturePolicy, VerificationResult,
};

use crate::wal_io::{read_wal_bytes, WalIoError};
use crate::wal_io::{for_each_wal_line, read_wal_bytes, WalIoError};
use crate::OutputFormat;

#[derive(Debug, thiserror::Error)]
Expand Down Expand Up @@ -44,12 +44,43 @@ pub fn run(
keystore_path: Option<&Path>,
trusted_pubkey: Option<&str>,
strict: bool,
chain_only: bool,
sample_signatures: Option<u64>,
manifest_version: u32,
format: OutputFormat,
) -> Result<bool, VerifyCmdError> {
let bytes = read_wal_bytes(wal_path)?;
// Reduced signature policies are a lenient-profile feature: the strict
// profile verifies every signature of the (capped) demo WAL by
// contract, so a request to skip or sample them there is a usage error.
if strict && (chain_only || sample_signatures.is_some()) {
return Err(VerifyCmdError::Usage(
"--chain-only and --sample-signatures apply to the lenient profile only; \
strict verifies every signature"
.to_string(),
));
}
if chain_only && sample_signatures.is_some() {
return Err(VerifyCmdError::Usage(
"choose either --chain-only or --sample-signatures, not both".to_string(),
));
}
if chain_only && (trusted_pubkey.is_some() || keystore_path.is_some()) {
return Err(VerifyCmdError::Usage(
"--chain-only skips per-record signature and receipt checks; remove \
--trusted-pubkey/--keystore (or drop --chain-only to verify them)"
.to_string(),
));
}
if sample_signatures == Some(0) {
return Err(VerifyCmdError::Usage(
"--sample-signatures N requires N >= 1".to_string(),
));
}

if strict {
// Strict is capped at MAX_RECORDS_DEMO records, so buffering the
// whole WAL is bounded; only the lenient path needs streaming.
let bytes = read_wal_bytes(wal_path)?;
return run_strict(
&bytes,
expected_root,
Expand Down Expand Up @@ -77,18 +108,28 @@ pub fn run(
None => None,
};

let policy = if chain_only {
SignaturePolicy::None
} else if let Some(n) = sample_signatures {
SignaturePolicy::Sample { one_in: n }
} else {
SignaturePolicy::All
};

let opts = LenientOptions {
expected_root,
keystore: keystore.as_ref(),
fail_fast,
trusted_pubkey: trusted_pubkey.as_deref(),
};

// verify_wal_bytes_with_options no longer returns Err: the
// partial report (records up to a fail-fast halt, plus the
// failing error in result.errors) is always emitted. We just
// surface it as-is.
let mut result = verify_wal_bytes_with_options(&bytes, &opts);
// Stream the WAL one line at a time so peak memory stays flat
// regardless of total size: the verifier holds only the running
// chain state (one line buffer plus a few hashes), not the WAL.
// process_line returns true under fail-fast to stop early.
let mut verifier = LenientVerifier::new(&opts, policy);
for_each_wal_line(wal_path, |line| verifier.process_line(line))?;
let mut result = verifier.finish();
maybe_add_profile_hint(&mut result);

emit_report(&result, output_path, format)?;
Expand Down Expand Up @@ -328,6 +369,9 @@ fn print_text_report(result: &VerificationResult) {
}
println!("Events verified: {}", result.events_verified);
println!("Signatures verified: {}", result.signatures_verified);
if result.signatures_skipped > 0 {
println!("Signatures skipped: {}", result.signatures_skipped);
}
println!("Receipts verified: {}", result.receipts_verified);
if let (Some(first), Some(last)) = (result.first_sequence, result.last_sequence) {
println!("Sequence range: {first}..={last}");
Expand Down
Loading