[core] Support snapshot-based sequence ordering for primary-key tables by JunRuiLee · Pull Request #7832 · apache/paimon

JunRuiLee · 2026-05-12T15:06:20Z

Purpose

Tests

SchemaValidationTest#testSnapshotSequenceOrderingHappyPath
SchemaValidationTest#testSnapshotSequenceOrderingRejectsSequenceField
SchemaValidationTest#testSnapshotSequenceOrderingRejectsNonPkTable
KeyValueWithLevelNoReusingSerializerSnapshotIdTest#testRoundTripWithSnapshotId
KeyValueWithLevelNoReusingSerializerSnapshotIdTest#testRoundTripWithoutSnapshotId
SortMergeSnapshotOrderingTest#testLaterSnapshotWinsOverHigherSequence
SortMergeSnapshotOrderingTest#testFallsBackToSequenceWhenSnapshotMissing
SortMergeSnapshotOrderingTest#testSameSnapshotFallsBackToSequence
SortMergeSnapshotOrderingTest#testStampedAlwaysBeatsUnstamped
PrimaryKeySimpleTableTest#testSnapshotSequenceOrdering
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingFallsBackToSequenceWithinSnapshot
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingCompactionPreservesInputSnapshotId
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingWithChangelogInput
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingWithChangelogLookup
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingDeleteFromLaterSnapshot

JunRuiLee · 2026-05-13T03:46:53Z

Hi @JingsongLi, could you help take a look? Many thanks.

JunRuiLee · 2026-05-14T07:07:58Z

Thanks @leaves12138 for the review! Fixed the compaction ordering issue by persisting per-record snapshotId through _SEQUENCE_NUMBER column. Added tests for the scenario you described. Old constructor removed.

PTAL, Thanks!

…lidation

…ders

… prevent ordering reversal

leaves12138

Thanks for the update. I took another careful pass over the snapshot-ordering implementation. I think there are still a few correctness issues to address before this can be safely merged.

leaves12138 · 2026-05-18T06:02:03Z

+                "%s = true is mutually exclusive with %s; the snapshot id is the sole tiebreaker.",
+                CoreOptions.SEQUENCE_SNAPSHOT_ORDERING.key(),
+                CoreOptions.SEQUENCE_FIELD.key());
+    }


This option is currently accepted for every primary-key merge engine, but the implementation only preserves snapshotId for merge functions that return an input KeyValue. For example, PartialUpdateMergeFunction and AggregateMergeFunction build a new KeyValue via replace(...), which resets snapshotId to UNKNOWN_SNAPSHOT_ID. During compaction, stampSequenceWithSnapshotId then writes -1 into _SEQUENCE_NUMBER / file sequence metadata, so later reads can order compacted records incorrectly. Could you either restrict sequence.snapshot-ordering to the supported merge engine(s) here, or propagate the winning snapshot id through all merge functions and add tests for partial-update / aggregation?

leaves12138 · 2026-05-18T06:02:03Z

                    deletionVectorsMaintainer,
-                    userDefinedSeqComparator);
+                    userDefinedSeqComparator,
+                    snapshotSequenceOrdering);


The lookup changelog path can still lose the snapshot id when LookupMergeFunction spills its KeyValueBuffer to the binary buffer. KeyValueBuffer.createBinaryBuffer still constructs new KeyValueWithLevelNoReusingSerializer(keyType, valueType) without includeSnapshotId, so after lookup.merge-records-threshold is exceeded, deserialized candidates have UNKNOWN_SNAPSHOT_ID and this comparator falls back to sequence-only ordering. Please thread snapshotSequenceOrdering into KeyValueBuffer's serializer and add a test that forces lookup-buffer spill, for example with a very small lookup.merge-records-threshold and an IOManager.

leaves12138 · 2026-05-18T06:02:03Z

+                    .booleanType()
+                    .defaultValue(false)
+                    .withDescription(
+                            "When enabled, merge uses the commit snapshot id as the primary "


This option also looks unsafe to enable on a table that already has data written without the feature. Existing APPEND files have minSequenceNumber as the old sequence range, and existing COMPACT files have _SEQUENCE_NUMBER as the old per-record sequence number; after toggling this option on, readers will interpret those values as snapshot ids. Could this be documented and/or rejected for ALTER TABLE as a creation-only option? Otherwise an existing table can silently reorder old records.

this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed.

…eBuffer spill PartialUpdateMergeFunction and AggregateMergeFunction reset snapshotId to UNKNOWN_SNAPSHOT_ID via reused.replace(...) in getResult(), causing compaction to stamp -1 into per-record _SEQUENCE_NUMBER and break snapshot-based ordering. Restore the latest input snapshotId on the merged result. KeyValueBuffer.createBinaryBuffer also dropped snapshotId during spill round-trip when snapshot-ordering was enabled; pass options.snapshotSequenceOrdering() to the serializer so spilled candidates survive deserialization. Adds unit tests for getResult() snapshotId across deduplicate / first-row / aggregate / partial-update, plus table-level regression tests covering partial-update and aggregate compaction and the lookup-merge spill path.

JunRuiLee · 2026-05-18T07:13:52Z

Thanks @leaves12138 for the careful review.

I fixed the first two correctness issues:

PartialUpdateMergeFunction and AggregateMergeFunction now preserve the winning input record’s snapshotId when returning a newly built KeyValue, so compaction no longer stamps UNKNOWN_SNAPSHOT_ID into
_SEQUENCE_NUMBER.
KeyValueBuffer now preserves snapshotId when snapshot ordering is enabled, so lookup compaction buffer spill does not lose it during binary serialization/deserialization.

I also added regression coverage for merge-function snapshotId preservation, partial-update compaction, aggregate compaction, and lookup buffer spill.

For the ALTER TABLE concern: this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed.

JunRuiLee force-pushed the snapshot-ordering-v2 branch from 36b0eaf to 2c737da Compare May 13, 2026 02:35

leaves12138 reviewed May 14, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/io/KeyValueDataFileRecordReader.java Outdated

leaves12138 reviewed May 14, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java Outdated

leaves12138 reviewed May 14, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java Outdated

JunRuiLee added 7 commits May 14, 2026 15:10

[core] Add sequence.snapshot-ordering option with mutual-exclusion va…

9a26574

…lidation

[core] Add snapshotId on KeyValue for snapshot-based merge ordering

d7dea7b

[core] Use commit snapshot id as primary tiebreaker in sort-merge rea…

899b73e

…ders

[core] Stamp KeyValue.snapshotId from file minSequenceNumber on read

bd86168

[core] Assign snapshot id to minSequenceNumber at commit time

9643abb

[core] End-to-end tests for snapshot-based sequence ordering

e90b67f

[hotfix] Fix per-record snapshotId preservation through compaction to…

37cc344

… prevent ordering reversal

JunRuiLee force-pushed the snapshot-ordering-v2 branch from 09ba5c9 to 37cc344 Compare May 14, 2026 07:35

leaves12138 requested changes May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Support snapshot-based sequence ordering for primary-key tables#7832

[core] Support snapshot-based sequence ordering for primary-key tables#7832
JunRuiLee wants to merge 8 commits into
apache:masterfrom
JunRuiLee:snapshot-ordering-v2

JunRuiLee commented May 12, 2026

Uh oh!

JunRuiLee commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JunRuiLee commented May 14, 2026

Uh oh!

leaves12138 left a comment

Uh oh!

leaves12138 May 18, 2026

Uh oh!

JunRuiLee May 18, 2026

Uh oh!

leaves12138 May 18, 2026

Uh oh!

JunRuiLee May 18, 2026 •

edited

Loading

Uh oh!

leaves12138 May 18, 2026

Uh oh!

JunRuiLee May 18, 2026

Uh oh!

JunRuiLee commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JunRuiLee commented May 12, 2026

Purpose

Tests

Uh oh!

JunRuiLee commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JunRuiLee commented May 14, 2026

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

leaves12138 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JunRuiLee May 18, 2026

Choose a reason for hiding this comment

Uh oh!

leaves12138 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JunRuiLee May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leaves12138 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JunRuiLee May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JunRuiLee commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JunRuiLee May 18, 2026 •

edited

Loading