Port datastax/jvector#659: Streaming N:1 on-disk graph index compaction by eolivelli · Pull Request #6 · eolivelli/jvector

eolivelli · 2026-05-09T07:08:18Z

Summary

Ports datastax/jvector#659 (Streaming N:1 compaction) into this fork. All 13 upstream commits cherry-picked onto our main (which carries the four local performance commits) without conflicts.

Adds OnDiskGraphIndexCompactor, a streaming N:1 compaction algorithm for merging multiple on-disk HNSW graph indexes into a single compacted index, plus PQ codebook retraining (PQRetrainer), CompactorBenchmark (JMH), and supporting reporting/storage utilities.

source[0].index  ─┐
source[1].index  ─┤──► OnDiskGraphIndexCompactor ──► compacted.index
source[N].index  ─┘

See docs/compaction.md and benchmarks-jmh/src/main/java/io/github/jbellis/jvector/bench/CompactorBenchmark.md for the full algorithm description and benchmarking instructions.

Upstream commits (cherry-picked, in order)

7c6ccd99 Add on-disk graph index compaction algorithm
52e72171 Add compaction unit tests
475ee063 Add reporting and storage infrastructure for CompactorBenchmark
ce40c754 Add CompactorBenchmark and tooling
c75256af Update build config and project metadata for compaction
415f907b Fix JMH jar selection in run-compaction.yml
224a709a Fix CompactorBenchmark invocation in run-compaction.yml
191a40d2 Address PR review feedback (extracts CompactWriter, rewrites SystemStatsCollector in pure Java)
06fff177 Fix benchmark invocation in docs and default dataset
6178afa1 Fix jar selection: use fixed output name compactor-benchmark.jar
0ab1deaf Refactor workload modes and fix build-from-scratch timing
3127043f Add TIERED_10_90 and TIERED_1_99 split distributions
632bc76d fix for bug when fused pq is used with no hierarchy (fix for bug when fused pq is used with no hierarchy datastax/jvector#664)

Verification

mvn -DskipTests -pl jvector-base,jvector-tests,jvector-examples,benchmarks-jmh -am compile → SUCCESS
mvn -pl jvector-tests -am -Dtest=TestOnDiskGraphIndexCompactor -Dsurefire.failIfNoSpecifiedTests=false test → 7 tests, 0 failures

Test plan

CI runs the full test suite on this branch
Optional: run CompactorBenchmark end-to-end on a representative dataset
Confirm interaction with the fork's zero-copy BufferVectorFloat, segmented DenseIntMap, and striped SparseIntMap paths under load

Introduce OnDiskGraphIndexCompactor and PQRetrainer for streaming N:1 merging of on-disk HNSW indexes without full in-memory materialization. Supports deletion filtering via live-node bitsets, custom ordinal mapping, and PQ codebook retraining.

Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions, ordinal remapping, multi-source merging, and FusedPQ compaction scenarios.

Add JFR recording, system stats collection, JSONL logging, git info capture, thread allocation tracking, dataset partitioning, and cloud storage layout utilities used by CompactorBenchmark. Switch jvector-examples logging from logback to log4j2 for consistency with benchmarks-jmh and to avoid duplicate SLF4J bindings in the fat jar.

JMH-based benchmark with configurable workload modes (PARTITION_AND_COMPACT, PARTITION_ONLY, COMPACT_ONLY, BUILD_FROM_SCRATCH), recall measurement, JFR recording, and JSONL result logging. Includes BenchmarkParamCounter for progress tracking, EventLogAnalyzer for post-run analysis, GHA workflow, and exec-maven-plugin integration. Add forced vectorization provider property to VectorizationProvider for benchmark reproducibility.

Add result file patterns to .gitignore, update rat-excludes for the new compaction workflow and catalog cache files.

The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has no Main-Class. Select the shaded JMH jar explicitly by excluding -javadoc and -sources jars.

Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main to avoid BenchmarkList discovery issues in CI's shaded jar.

- Extract CompactWriter into its own file to reduce OnDiskGraphIndexCompactor size - Rewrite SystemStatsCollector to read /proc files directly in Java instead of spawning bash - Clarify recall section description in docs/compaction.md

Use -cp instead of -jar in docs since the benchmarks-jmh-*.jar glob matches the -javadoc jar first. Change default dataset from glove-100-angular to ada002-100k. Note -Xmx should be adjusted to fit the dataset.

The benchmarks-jmh-*.jar glob expands to multiple jars (shaded + javadoc), causing -cp to misinterpret the second jar as the main class. Configure shade plugin outputFile to produce a fixed compactor-benchmark.jar name. Update docs and CI workflow.

Simplify WorkloadMode enum: PARTITION_ONLY/COMPACT_ONLY/COMPACT_AND_RECALL/ BUILD_FROM_SCRATCH collapsed into PARTITION/COMPACT/BUILD/PARTITION_AND_COMPACT plus a separate measureRecall flag. Fix buildFromScratch timing to include PQ computation and graph construction (previously only timed the write step). Add fair comparison guidelines to CompactorBenchmark.md.

Support 10%/90% and 1%/99% partition splits for benchmarking compaction of a small new segment into a large existing index. Add split distribution reference table to CompactorBenchmark.md.

dian-lun-lin and others added 13 commits May 9, 2026 09:05

Add compaction unit tests

3ad6d06

Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions, ordinal remapping, multi-source merging, and FusedPQ compaction scenarios.

Update build config and project metadata for compaction

9e020cc

Add result file patterns to .gitignore, update rat-excludes for the new compaction workflow and catalog cache files.

Fix JMH jar selection in run-compaction.yml

11def8a

The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has no Main-Class. Select the shaded JMH jar explicitly by excluding -javadoc and -sources jars.

Fix CompactorBenchmark invocation in run-compaction.yml

4fcc641

Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main to avoid BenchmarkList discovery issues in CI's shaded jar.

Address PR review feedback

fad066c

- Extract CompactWriter into its own file to reduce OnDiskGraphIndexCompactor size - Rewrite SystemStatsCollector to read /proc files directly in Java instead of spawning bash - Clarify recall section description in docs/compaction.md

Fix benchmark invocation in docs and default dataset

b4b1074

Use -cp instead of -jar in docs since the benchmarks-jmh-*.jar glob matches the -javadoc jar first. Change default dataset from glove-100-angular to ada002-100k. Note -Xmx should be adjusted to fit the dataset.

Add TIERED_10_90 and TIERED_1_99 split distributions

16effcd

Support 10%/90% and 1%/99% partition splits for benchmarking compaction of a small new segment into a large existing index. Add split distribution reference table to CompactorBenchmark.md.

fix for bug when fused pq is used with no hierarchy (datastax#664)

9142465

eolivelli merged commit 5ba15c1 into main May 9, 2026
4 of 10 checks passed

eolivelli mentioned this pull request May 9, 2026

Adopt jvector OnDiskGraphIndexCompactor for streaming N:1 segment compaction eolivelli/herddb#485

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port datastax/jvector#659: Streaming N:1 on-disk graph index compaction#6

Port datastax/jvector#659: Streaming N:1 on-disk graph index compaction#6
eolivelli merged 13 commits intomainfrom
port/pr-659-streaming-compaction

eolivelli commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eolivelli commented May 9, 2026

Summary

Upstream commits (cherry-picked, in order)

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants