[core] Support snapshot-based sequence ordering for primary-key tables#7832
[core] Support snapshot-based sequence ordering for primary-key tables#7832JunRuiLee wants to merge 8 commits into
Conversation
36b0eaf to
2c737da
Compare
|
Hi @JingsongLi, could you help take a look? Many thanks. |
|
Thanks @leaves12138 for the review! Fixed the compaction ordering issue by persisting per-record snapshotId through _SEQUENCE_NUMBER column. Added tests for the scenario you described. Old constructor removed. PTAL, Thanks! |
… prevent ordering reversal
09ba5c9 to
37cc344
Compare
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I took another careful pass over the snapshot-ordering implementation. I think there are still a few correctness issues to address before this can be safely merged.
| "%s = true is mutually exclusive with %s; the snapshot id is the sole tiebreaker.", | ||
| CoreOptions.SEQUENCE_SNAPSHOT_ORDERING.key(), | ||
| CoreOptions.SEQUENCE_FIELD.key()); | ||
| } |
There was a problem hiding this comment.
This option is currently accepted for every primary-key merge engine, but the implementation only preserves snapshotId for merge functions that return an input KeyValue. For example, PartialUpdateMergeFunction and AggregateMergeFunction build a new KeyValue via replace(...), which resets snapshotId to UNKNOWN_SNAPSHOT_ID. During compaction, stampSequenceWithSnapshotId then writes -1 into _SEQUENCE_NUMBER / file sequence metadata, so later reads can order compacted records incorrectly. Could you either restrict sequence.snapshot-ordering to the supported merge engine(s) here, or propagate the winning snapshot id through all merge functions and add tests for partial-update / aggregation?
| deletionVectorsMaintainer, | ||
| userDefinedSeqComparator); | ||
| userDefinedSeqComparator, | ||
| snapshotSequenceOrdering); |
There was a problem hiding this comment.
The lookup changelog path can still lose the snapshot id when LookupMergeFunction spills its KeyValueBuffer to the binary buffer. KeyValueBuffer.createBinaryBuffer still constructs new KeyValueWithLevelNoReusingSerializer(keyType, valueType) without includeSnapshotId, so after lookup.merge-records-threshold is exceeded, deserialized candidates have UNKNOWN_SNAPSHOT_ID and this comparator falls back to sequence-only ordering. Please thread snapshotSequenceOrdering into KeyValueBuffer's serializer and add a test that forces lookup-buffer spill, for example with a very small lookup.merge-records-threshold and an IOManager.
| .booleanType() | ||
| .defaultValue(false) | ||
| .withDescription( | ||
| "When enabled, merge uses the commit snapshot id as the primary " |
There was a problem hiding this comment.
This option also looks unsafe to enable on a table that already has data written without the feature. Existing APPEND files have minSequenceNumber as the old sequence range, and existing COMPACT files have _SEQUENCE_NUMBER as the old per-record sequence number; after toggling this option on, readers will interpret those values as snapshot ids. Could this be documented and/or rejected for ALTER TABLE as a creation-only option? Otherwise an existing table can silently reorder old records.
There was a problem hiding this comment.
this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed.
…eBuffer spill PartialUpdateMergeFunction and AggregateMergeFunction reset snapshotId to UNKNOWN_SNAPSHOT_ID via reused.replace(...) in getResult(), causing compaction to stamp -1 into per-record _SEQUENCE_NUMBER and break snapshot-based ordering. Restore the latest input snapshotId on the merged result. KeyValueBuffer.createBinaryBuffer also dropped snapshotId during spill round-trip when snapshot-ordering was enabled; pass options.snapshotSequenceOrdering() to the serializer so spilled candidates survive deserialization. Adds unit tests for getResult() snapshotId across deduplicate / first-row / aggregate / partial-update, plus table-level regression tests covering partial-update and aggregate compaction and the lookup-merge spill path.
|
Thanks @leaves12138 for the careful review. I fixed the first two correctness issues:
I also added regression coverage for merge-function snapshotId preservation, partial-update compaction, aggregate compaction, and lookup buffer spill. For the ALTER TABLE concern: this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed. |
Purpose
close #7806
Tests