feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads by GayathriSrividya · Pull Request #3461 · apache/iceberg-python

GayathriSrividya · 2026-06-05T07:11:48Z

Rationale

Columns that contain large or frequently repeated string values (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader natively supports dictionary-encoded reads via its dictionary_columns kwarg, which deduplicates values and can dramatically reduce peak memory usage.

This was previously discussed in #3168 and a prior implementation (#3234) was closed as stale.

Changes

Added dictionary_columns: tuple[str, ...] = () to Table.scan(), TableScan.__init__, and StagedTable.scan().
Forwarded through DataScan.to_arrow() and to_arrow_batch_reader() → ArrowScan.__init__ → _task_to_record_batches → _get_file_format().
Only applied when task.file.file_format == FileFormat.PARQUET; silently ignored for ORC (which does not support this kwarg).

Usage

# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()

Verification

Added test_dictionary_columns_produces_dict_encoded_output — confirms the requested column is dict-encoded, non-requested columns are plain, and values are identical.
make lint ✓
pytest tests/table/ tests/io/test_pyarrow.py ✓

…icient reads Columns that contain large or frequently repeated strings (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader supports reading such columns as dictionary-encoded arrays, which deduplicates values and can dramatically reduce memory usage. Add a dictionary_columns: tuple[str, ...] parameter to Table.scan() (and the underlying TableScan / ArrowScan classes) that is forwarded to _get_file_format() as PyArrow's dictionary_columns kwarg. Only applies to Parquet files; silently ignored for ORC. Usage: table.scan(dictionary_columns=("payload",)).to_arrow() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads#3461

feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads#3461
GayathriSrividya wants to merge 1 commit into
apache:mainfrom
GayathriSrividya:feat/issue-3170-dictionary-columns-scan

GayathriSrividya commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GayathriSrividya commented Jun 5, 2026

Rationale

Changes

Usage

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant