[core] Support manifest sort feature when commit#7842
Conversation
| /** | ||
| * Compares the value at field {@code k} of two {@link BinaryRow}s according to {@code type}. | ||
| */ | ||
| static int compareField(BinaryRow a, BinaryRow b, int k, DataType type) { |
There was a problem hiding this comment.
Why not use CodeGenUtils.newRecordComparator?
| } | ||
| } | ||
|
|
||
| if (!addedToExisting) { |
There was a problem hiding this comment.
Do not use boolean addedToExisting.
Just
List earliestRun = runs.pool();
if (earliestRun == null) {
do something
} else if (compare(xxx) > 0) {
do something
} else {
do something
}
It makes this more pretty
| last.partitionStats().maxValues(), | ||
| sortFieldIndex, | ||
| sortFieldType) | ||
| >= 0) { |
There was a problem hiding this comment.
There is overlap in one run if "equals".
There was a problem hiding this comment.
I designed it this way to ensure that the minimum number of Sorted runs is built to reduce the burden of sorting.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I took another pass over the latest revision, and I think there are still a few issues that should be fixed before merging.
-
The boundary condition for interval overlap still looks wrong.
In
ManifestFileSorter.buildLevelSortedRuns, a file is appended to an existing run whenfile.min >= last.max. InsplitIntoSections, a new section is also started whenfile.min >= sectionMaxBound.However, partition stats represent closed intervals. For example,
[1, 3]and[3, 5]still overlap at partition value3. A sorted run is documented as containing non-overlapping intervals, so this case should not be placed into the same run. Similarly, sections are used as overlap-connected rewrite units, so this case should not be split into different sections either. I think both checks should use> 0, not>= 0, and we should add a test for themax == minboundary case. -
manifest-sort.enabledcurrently bypasses the original manifest compaction trigger/gate.ManifestFileMerger.mergedirectly entersManifestFileSorter.trySortRewriteand returns from that path when manifest sort is enabled. This means the originalmanifest.full-compaction-threshold-sizeandmanifest.merge-min-countbehavior is no longer applied in the same way. InsideclassifyManifests, files are classified only byfileSize < targetSizeor delete-range overlap, so small manifests / delete manifests can trigger sort rewrite more aggressively than the existing merge logic.If this is intentional, I think the new semantics should be documented very clearly. Otherwise, the sort path should preserve the existing full/minor compaction gates, especially
manifest.merge-min-countfor minor manifest merging. -
The partial rewrite path for
manifest-sort.max-rewrite-sizecan break the output order.When the first section exceeds the rewrite budget,
rewriteSectionssplits it intorewriteFilesandremainingFiles, rewrites the first part, and appends the remaining section to the end of thesectionslist. If there are later sections with larger key ranges, the remaining part of the current section will be emitted after them, which can produce an order like0..10, 20..30, 10..20.To keep the manifest list sorted, I think we should either skip the whole section once the budget is exceeded, or keep the remaining section at the current position/order instead of appending it to the tail. This also needs a regression test.
-
Test coverage is still missing some important edge cases.
The new tests cover large overlapping ranges and delete elimination, but I do not see coverage for boundary-touching intervals (
max == min),manifest.merge-min-count/ full threshold behavior undermanifest-sort.enabled,manifest-sort.max-rewrite-sizepreserving global output order, or null partition values. The switch toRecordComparatoris a good improvement, but a null partition test would make this safer.
Thanks for your comment.
|
|
| totalDeltaFileSize += file.fileSize(); | ||
| } | ||
| } | ||
| boolean removeAllDelete = totalDeltaFileSize >= sizeTrigger; |
There was a problem hiding this comment.
Rename to triggerFullCompact
| Map<ManifestFileMeta, Boolean> defaultCompactionManifests = new LinkedHashMap<>(); | ||
| List<ManifestFileMeta> lsmFiles = new LinkedList<>(input); | ||
| Set<FileEntry.Identifier> deleteEntries = | ||
| FileEntry.readDeletedEntries(manifestFile, input, manifestReadParallelism); |
There was a problem hiding this comment.
Why read delete every time? If not full compaction, we still need to read all the deletes?
There was a problem hiding this comment.
Split full compaction and minor compaction, don't make it mixed. Refer to ManifestFileMerge.merge
There was a problem hiding this comment.
Thanks, I will split full compaction and minor compaction.
Purpose
Tests