Skip to content

[core] Support snapshot-based sequence ordering for primary-key tables#7832

Open
JunRuiLee wants to merge 8 commits into
apache:masterfrom
JunRuiLee:snapshot-ordering-v2
Open

[core] Support snapshot-based sequence ordering for primary-key tables#7832
JunRuiLee wants to merge 8 commits into
apache:masterfrom
JunRuiLee:snapshot-ordering-v2

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

Purpose

close #7806

Tests

  • SchemaValidationTest#testSnapshotSequenceOrderingHappyPath
  • SchemaValidationTest#testSnapshotSequenceOrderingRejectsSequenceField
  • SchemaValidationTest#testSnapshotSequenceOrderingRejectsNonPkTable
  • KeyValueWithLevelNoReusingSerializerSnapshotIdTest#testRoundTripWithSnapshotId
  • KeyValueWithLevelNoReusingSerializerSnapshotIdTest#testRoundTripWithoutSnapshotId
  • SortMergeSnapshotOrderingTest#testLaterSnapshotWinsOverHigherSequence
  • SortMergeSnapshotOrderingTest#testFallsBackToSequenceWhenSnapshotMissing
  • SortMergeSnapshotOrderingTest#testSameSnapshotFallsBackToSequence
  • SortMergeSnapshotOrderingTest#testStampedAlwaysBeatsUnstamped
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrdering
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingFallsBackToSequenceWithinSnapshot
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingCompactionPreservesInputSnapshotId
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingWithChangelogInput
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingWithChangelogLookup
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingDeleteFromLaterSnapshot

@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch from 36b0eaf to 2c737da Compare May 13, 2026 02:35
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Hi @JingsongLi, could you help take a look? Many thanks.

Comment thread paimon-core/src/main/java/org/apache/paimon/io/KeyValueDataFileRecordReader.java Outdated
Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java Outdated
Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java Outdated
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Thanks @leaves12138 for the review! Fixed the compaction ordering issue by persisting per-record snapshotId through _SEQUENCE_NUMBER column. Added tests for the scenario you described. Old constructor removed.

PTAL, Thanks!

@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch from 09ba5c9 to 37cc344 Compare May 14, 2026 07:35
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I took another careful pass over the snapshot-ordering implementation. I think there are still a few correctness issues to address before this can be safely merged.

"%s = true is mutually exclusive with %s; the snapshot id is the sole tiebreaker.",
CoreOptions.SEQUENCE_SNAPSHOT_ORDERING.key(),
CoreOptions.SEQUENCE_FIELD.key());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option is currently accepted for every primary-key merge engine, but the implementation only preserves snapshotId for merge functions that return an input KeyValue. For example, PartialUpdateMergeFunction and AggregateMergeFunction build a new KeyValue via replace(...), which resets snapshotId to UNKNOWN_SNAPSHOT_ID. During compaction, stampSequenceWithSnapshotId then writes -1 into _SEQUENCE_NUMBER / file sequence metadata, so later reads can order compacted records incorrectly. Could you either restrict sequence.snapshot-ordering to the supported merge engine(s) here, or propagate the winning snapshot id through all merge functions and add tests for partial-update / aggregation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

deletionVectorsMaintainer,
userDefinedSeqComparator);
userDefinedSeqComparator,
snapshotSequenceOrdering);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lookup changelog path can still lose the snapshot id when LookupMergeFunction spills its KeyValueBuffer to the binary buffer. KeyValueBuffer.createBinaryBuffer still constructs new KeyValueWithLevelNoReusingSerializer(keyType, valueType) without includeSnapshotId, so after lookup.merge-records-threshold is exceeded, deserialized candidates have UNKNOWN_SNAPSHOT_ID and this comparator falls back to sequence-only ordering. Please thread snapshotSequenceOrdering into KeyValueBuffer's serializer and add a test that forces lookup-buffer spill, for example with a very small lookup.merge-records-threshold and an IOManager.

Copy link
Copy Markdown
Contributor Author

@JunRuiLee JunRuiLee May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

.booleanType()
.defaultValue(false)
.withDescription(
"When enabled, merge uses the commit snapshot id as the primary "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option also looks unsafe to enable on a table that already has data written without the feature. Existing APPEND files have minSequenceNumber as the old sequence range, and existing COMPACT files have _SEQUENCE_NUMBER as the old per-record sequence number; after toggling this option on, readers will interpret those values as snapshot ids. Could this be documented and/or rejected for ALTER TABLE as a creation-only option? Otherwise an existing table can silently reorder old records.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed.

…eBuffer spill

PartialUpdateMergeFunction and AggregateMergeFunction reset snapshotId to
UNKNOWN_SNAPSHOT_ID via reused.replace(...) in getResult(), causing compaction to
stamp -1 into per-record _SEQUENCE_NUMBER and break snapshot-based ordering.
Restore the latest input snapshotId on the merged result.

KeyValueBuffer.createBinaryBuffer also dropped snapshotId during spill round-trip
when snapshot-ordering was enabled; pass options.snapshotSequenceOrdering() to
the serializer so spilled candidates survive deserialization.

Adds unit tests for getResult() snapshotId across deduplicate / first-row /
aggregate / partial-update, plus table-level regression tests covering
partial-update and aggregate compaction and the lookup-merge spill path.
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Thanks @leaves12138 for the careful review.

I fixed the first two correctness issues:

  1. PartialUpdateMergeFunction and AggregateMergeFunction now preserve the winning input record’s snapshotId when returning a newly built KeyValue, so compaction no longer stamps UNKNOWN_SNAPSHOT_ID into
    _SEQUENCE_NUMBER.
  2. KeyValueBuffer now preserves snapshotId when snapshot ordering is enabled, so lookup compaction buffer spill does not lose it during binary serialization/deserialization.

I also added regression coverage for merge-function snapshotId preservation, partial-update compaction, aggregate compaction, and lookup buffer spill.

For the ALTER TABLE concern: this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support snapshot-based sequence ordering for primary-key tables

2 participants