[parquet] Add map shredding for hot keys by Aitozi · Pull Request #7877 · apache/paimon

Aitozi · 2026-05-17T04:05:01Z

Purpose

Add Parquet map shredding support for MAP<STRING, T> columns.

This allows selected map columns to extract hot keys into independent physical Parquet columns while preserving the original logical map schema for readers. The feature is controlled by map.shredding.* options, aligned with the existing variant.shredding.* naming style. It also adds a focused round-trip test and a storage benchmark to validate the storage benefit.

Tests

mvn -pl paimon-api,paimon-format -Pfast-build -DskipTests compile
mvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false -Dtest=ParquetFormatReadWriteTest#testMapShreddingRoundTrip,MapShreddingStorageBenchmark test
git diff --check

Physical Layout

This change does not introduce a new Parquet logical type and does not modify the standard Parquet MAP encoding. A shredded map is still written with the regular Parquet map group as the residual map. Hot keys are promoted into additional sibling sidecar columns in the parent Parquet group.

For example, a logical field:

headers MAP<STRING, STRING>

is normally written as:

message paimon_schema {
  optional group headers (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }
}

With map shredding enabled, if user-agent and host are selected as hot keys, the physical Parquet schema becomes:

message paimon_schema {
  optional group headers (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }

  optional binary dynamic_column_headers_value_0 (STRING);
  optional binary dynamic_column_headers_value_1 (STRING);
}

The footer metadata records the mapping from sidecar columns to map keys:

parquet.meta.dynamic.column.map.keys.of.headers = user-agent,host

During writing, entries for promoted hot keys are omitted from the residual map when their values are non-null, and their values are written into the corresponding sidecar columns. During reading, Paimon reads both the residual map and the sidecar columns, then reconstructs the original logical MAP<STRING, T> value.

For nested maps, the same rule applies within the containing row group. For example, for payload.headers, sidecar columns are added as siblings of the headers map inside the payload group, and the footer metadata uses the full logical path:

parquet.meta.dynamic.column.map.keys.of.payload.headers = user-agent,host

#7876

Aitozi · 2026-05-17T05:04:58Z

Benchmark command:

mvn -s ~/.m2/apache-community.xml -pl paimon-format -am -Pfast-build \
  -DfailIfNoTests=false -Dtest=MapShreddingStorageBenchmark test

Benchmark file: [MapShreddingStorageBenchmark.java]

Common Setup

Schema: id INT, headers MAP<STRING, STRING>
Rows: 100,000
Hot keys: 32
Value length: 16
Compression: snappy
Compared layouts:
- regular: normal Parquet map encoding
- mapShredding: promotes 32 hot keys from headers into sidecar columns
Map shredding options:
- map.shredding.columns=headers
- map.shredding.maxKeys=32
- map.shredding.maxInferBufferRow=10000
- map.shredding.maxInferBufferMemory=64 mb

Results

Scenario	Regular	Map Shredding	Saved	Saving
Columnar value storage	708,012 bytes	431,637 bytes	276,375 bytes	39.04%
Long hot key storage	40,845,943 bytes	16,365,106 bytes	24,480,837 bytes	59.93%

Scenario Details

Columnar value storage: key names are short, values follow a repeated pattern with valueRunLength=128 and valueCardinality=4, dictionary encoding enabled. This measures whether promoted hot-key values benefit from columnar and dictionary encoding.
Long hot key storage: hot key names include 128 bytes of padding, dictionary encoding disabled. This measures the benefit of avoiding repeated long map-key strings in every row.

Conclusion: in this synthetic storage benchmark, map shredding reduces file size in both cases. The biggest gain appears when hot map keys are long and repeated across many rows, saving about 59.93%.

JingsongLi · 2026-05-17T13:42:31Z

This looks very suitable to be solved using Variant, why not?

Aitozi force-pushed the mwj-map-shredding branch from 61967d4 to 5a5b5a5 Compare May 17, 2026 04:36

[parquet] Add map shredding for hot keys

5f397f8

Aitozi force-pushed the mwj-map-shredding branch from 5a5b5a5 to 5f397f8 Compare May 17, 2026 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parquet] Add map shredding for hot keys#7877

[parquet] Add map shredding for hot keys#7877
Aitozi wants to merge 1 commit into
apache:masterfrom
Aitozi:mwj-map-shredding

Aitozi commented May 17, 2026 •

edited

Loading

Uh oh!

Aitozi commented May 17, 2026

Uh oh!

JingsongLi commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aitozi commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Physical Layout

Uh oh!

Aitozi commented May 17, 2026

Uh oh!

JingsongLi commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aitozi commented May 17, 2026 •

edited

Loading