[parquet] Add map shredding for hot keys#7877
Open
Aitozi wants to merge 1 commit into
Open
Conversation
Contributor
Author
|
Benchmark command: mvn -s ~/.m2/apache-community.xml -pl paimon-format -am -Pfast-build \
-DfailIfNoTests=false -Dtest=MapShreddingStorageBenchmark testBenchmark file: [MapShreddingStorageBenchmark.java] Common Setup
Results
Scenario Details
Conclusion: in this synthetic storage benchmark, map shredding reduces file size in both cases. The biggest gain appears when hot map keys are long and repeated across many rows, saving about |
Contributor
|
This looks very suitable to be solved using Variant, why not? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Add Parquet map shredding support for
MAP<STRING, T>columns.This allows selected map columns to extract hot keys into independent physical Parquet columns while preserving the original logical map schema for readers. The feature is controlled by
map.shredding.*options, aligned with the existingvariant.shredding.*naming style. It also adds a focused round-trip test and a storage benchmark to validate the storage benefit.Tests
mvn -pl paimon-api,paimon-format -Pfast-build -DskipTests compilemvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false -Dtest=ParquetFormatReadWriteTest#testMapShreddingRoundTrip,MapShreddingStorageBenchmark testgit diff --checkPhysical Layout
This change does not introduce a new Parquet logical type and does not modify the standard Parquet
MAPencoding. A shredded map is still written with the regular Parquet map group as the residual map. Hot keys are promoted into additional sibling sidecar columns in the parent Parquet group.For example, a logical field:
is normally written as:
With map shredding enabled, if
user-agentandhostare selected as hot keys, the physical Parquet schema becomes:The footer metadata records the mapping from sidecar columns to map keys:
During writing, entries for promoted hot keys are omitted from the residual map when their values are non-null, and their values are written into the corresponding sidecar columns. During reading, Paimon reads both the residual map and the sidecar columns, then reconstructs the original logical
MAP<STRING, T>value.For nested maps, the same rule applies within the containing row group. For example, for
payload.headers, sidecar columns are added as siblings of theheadersmap inside thepayloadgroup, and the footer metadata uses the full logical path:#7876