Skip to content

feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads#3461

Open
GayathriSrividya wants to merge 1 commit into
apache:mainfrom
GayathriSrividya:feat/issue-3170-dictionary-columns-scan
Open

feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads#3461
GayathriSrividya wants to merge 1 commit into
apache:mainfrom
GayathriSrividya:feat/issue-3170-dictionary-columns-scan

Conversation

@GayathriSrividya
Copy link
Copy Markdown

Closes #3170

Rationale

Columns that contain large or frequently repeated string values (e.g. JSON blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them as plain string arrays. PyArrow's Parquet reader natively supports dictionary-encoded reads via its dictionary_columns kwarg, which deduplicates values and can dramatically reduce peak memory usage.

This was previously discussed in #3168 and a prior implementation (#3234) was closed as stale.

Changes

  • Added dictionary_columns: tuple[str, ...] = () to Table.scan(), TableScan.__init__, and StagedTable.scan().
  • Forwarded through DataScan.to_arrow() and to_arrow_batch_reader()ArrowScan.__init___task_to_record_batches_get_file_format().
  • Only applied when task.file.file_format == FileFormat.PARQUET; silently ignored for ORC (which does not support this kwarg).

Usage

# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()

Verification

  • Added test_dictionary_columns_produces_dict_encoded_output — confirms the requested column is dict-encoded, non-requested columns are plain, and values are identical.
  • make lint
  • pytest tests/table/ tests/io/test_pyarrow.py

…icient reads

Columns that contain large or frequently repeated strings (e.g. JSON
blobs, low-cardinality categoricals) can exhaust memory when PyArrow
loads them as plain string arrays.  PyArrow's Parquet reader supports
reading such columns as dictionary-encoded arrays, which deduplicates
values and can dramatically reduce memory usage.

Add a dictionary_columns: tuple[str, ...] parameter to Table.scan()
(and the underlying TableScan / ArrowScan classes) that is forwarded
to _get_file_format() as PyArrow's dictionary_columns kwarg.  Only
applies to Parquet files; silently ignored for ORC.

Usage:
    table.scan(dictionary_columns=("payload",)).to_arrow()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature request: pass optional parameters to DataScan/pyarrow

1 participant