Fix #375: chunked vector loading uses globally-unique primary keys#387
Open
idevasena wants to merge 1 commit into
Open
Fix #375: chunked vector loading uses globally-unique primary keys#387idevasena wants to merge 1 commit into
idevasena wants to merge 1 commit into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
FileSystemGuy
approved these changes
May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix #375: chunked vector loading uses globally-unique primary keys
Closes #375.
Problem
In a 1M-vector dry-run on a single Gen5 NVMe,
vdb_benchmarkreported mean recall@10 = 0.0090. Themlps_1m_1shards_1536dim_uniform_flat_gtground-truth collection held only 10,000 vectors — 1% of the source collection — so almost every PK returned by the ANN search was missing from the GT set, andset_intersection / kcollapsed.Root cause
load_vdb.insert_data()built each batch's primary keys aswhere
batch_start/batch_endwere the chunk-local indices. Whennum_vectors > chunk_size, the caller inmain()invokesinsert_dataonce per generated chunk and passes only that chunk's vectors. Withchunk_size = 10_000, every chunk therefore inserted IDs0..9_999, i.e. all 100 chunks collided on the same 10 000 primary keys.num_entitiesstill reports 1 000 000 because Milvus counts physical rows, not distinct PKs — masking the bug during loading.enhanced_bench.create_flat_collection()copies the source viaquery_iterator(), which deduplicates by PK, so the FLAT collection only ever sees the 10 000 unique IDs.enhanced_bench.pyhardcoded the final copy-progress line to(100.0%), hiding the discrepancy in the logs (Copied 10000/1000000 vectors (100.0%)in the original report).Fix
vdb_benchmark/vdbbench/load_vdb.pyinsert_data()takes a newstart_id(default0, preserves legacy single-chunk behavior). IDs are nowrange(start_id + batch_start, start_id + batch_end).vdb_benchmark/vdbbench/load_vdb.pymain()threads a runningglobal_id_offsetthrough the chunked-generation loop and passes it asstart_idon everyinsert_datacall. Theelse(single-chunk) branch passesstart_id=0explicitly for clarity.vdb_benchmark/vdbbench/enhanced_bench.py(100.0%)with the real percentage increate_flat_collection().vdb_benchmark/vdbbench/enhanced_bench.pysource_coll.num_entities, abort with a clear pointer to issue #375 instead of silently producing meaningless recall numbers.vdb_benchmark/tests/tests/test_issue_375_chunked_insert_ids.pystart_idoffset, the three-chunk scenario from the bug report, uneven final chunks, batch sizes larger than the chunk, and the coverage-threshold parametrization.Testing