refactor(storage): compact/recluster flow and clustering stats derivation#19754
refactor(storage): compact/recluster flow and clustering stats derivation#19754zhyass merged 11 commits intodatabendlabs:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b775db004
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
9b0a524 to
386702b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b0a524ffa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
fix test
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dae4fdf523
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
|
Nice work. A few concerns came up while reading through the PR, but none are meant as blockers. Feel free to address them here or in follow-up PRs.
Example: a table was originally clustered by ALTER TABLE t CLUSTER BY (b); The existing blocks still carry During recluster, For example, with current key S0: B0 [1,2], B1 [9,10] At segment level, In this case, assuming the blocks are not selected by the small-block compaction checks, no block rewrite task is generated because the block overlap depth does not exceed the threshold. So recluster falls into the zero-task path: However, Commit-side segment generation then sorts by As a result, recluster can create a new snapshot / rewritten segment metadata without actually improving the segment layout. With S0: B0 [1,2], B2 [3,4] => segment range [1,4] These segment ranges no longer overlap, so segment pruning can benefit from the recluster. But if commit keeps the original order, the rewritten segments remain: S0: B0 [1,2], B1 [9,10] => segment range [1,10] The segment ranges still overlap, so the metadata rewrite does not improve clustering/pruning. With
For a Hilbert-clustered table, normal inserts do not generate block/segment After the user compacts the table, In v.cluster_key_id != self.default_cluster_id Since the backfilled stats use the current cluster key id, this condition is false, so the compacted segment is not selected for Hilbert recluster. This looks like a behavior regression: before this PR, the compacted segment would still have had Additional note: Two places implicitly depend on ratios defined inside
The comment in Consider exposing these ratios through |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d9fe6f338
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
fix review comments fix test fix test fix test
|
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
This comment was marked as outdated.
This comment was marked as outdated.
|
@codex review |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Docker Image for PR
|
fe693bb
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
Refactor the responsibility boundary between
compactandreclusterfor FUSE tables. Compact no longer performs reordering, clustering-metadata repair, or forced rewrites caused by schema drift; recluster is responsible only for reclustering by the current cluster key, and schema drift no longer triggers old-block rewrites.Refine recluster semantics so it no longer piggybacks compaction behavior. Blocks that have not been reclustered under the current cluster key are now also eligible as recluster candidates, and can be locally sorted before entering merge-sort when needed.
Introduce BlockMetaOptions in the read path to consolidate block metadata flags such as block-index reservation, stream-column updates, and internal-column queries.
Update recluster candidate selection, task generation, and statistics derivation. The new flow reuses existing
cluster_stats when available and otherwise derives
min/maxfrom column statistics; those derived values are also reused byclustering_informationand by recluster decisions on older blocks and segments.Add segment-summary cluster_stats backfill during mutation writes: when block-level cluster_stats are missing, the new segment summary derives min/max from its own merged col_stats and the current cluster key instead of leaving segment-level clustering stats empty.
Simplify the compact path. Compact no longer checks column-id drift, and no longer materializes default values for newly added columns during compaction; after
ADD COLUMN, compact only performs physical compaction.Tests
Type of change
This change is