[python] Add parallel split reading to to_pandas / to_arrow#7870
Open
TheR1sing3un wants to merge 2 commits into
Open
[python] Add parallel split reading to to_pandas / to_arrow#7870TheR1sing3un wants to merge 2 commits into
TheR1sing3un wants to merge 2 commits into
Conversation
9fe7ec1 to
b762e0a
Compare
JingsongLi
reviewed
May 16, 2026
| def to_arrow( | ||
| self, | ||
| splits: List[Split], | ||
| max_workers: Optional[int] = None, |
Contributor
There was a problem hiding this comment.
Maybe it is better to provide an table option read.parallelism?
Member
Author
There was a problem hiding this comment.
Maybe it is better to provide an table option
read.parallelism?
A very good suggestion. I added this option.
By the way, since this option is a table option, and for read apis like to_pandas/to_arrow, is it still necessary to retain a parameter to override the parallelism of the table parameter? I have currently provided this parameter. If you think it's not necessary, I can remove it again.
Today TableRead.to_pandas / to_arrow iterate splits serially in
_arrow_batch_generator, so wall time scales linearly with the number
of splits even though PyArrow's parquet/orc readers release the GIL
during decode. Unlike Java, where Flink/Spark fan splits out across
TaskManagers/Executors, PyPaimon has no external framework above the
SDK; split-level parallelism therefore has to live inside the SDK.
This commit adds an opt-in dual-track API for split-level parallelism:
1. A new table option `read.parallelism` (default 1) sets the
persistent default for a table:
options={'read.parallelism': '4'}.
2. A new method argument `parallelism` on to_pandas / to_arrow
temporarily overrides the option for a single call:
read.to_pandas(splits, parallelism=8).
Priority: method argument > table option > built-in default of 1.
This covers both "configure once, all reads benefit" (option) and
ad-hoc tuning without altering the table schema (argument).
Behavior:
- effective == 1 (default or explicit) keeps the serial path
unchanged; no thread pool is created.
- effective >= 2 with at least 2 splits runs splits through a
ThreadPoolExecutor and assembles the final Table in the input
splits' order (results collected by submission index).
- effective < 1 (from either source) raises ValueError naming
whichever source produced the value.
Limit pushdown stays correct under parallelism via _RemainingRows, a
thread-safe row-quota counter. Quota is pre-debited under a single
lock so the combined output never exceeds self.limit, even if
individual readers decode one extra batch after the quota is gone -
that batch is simply dropped instead of being emitted.
Reader resource handling matches the serial path: each worker uses
try/finally to close its reader, and ThreadPoolExecutor's wait-on-
exit guarantees every started reader is closed before the call
returns, even when one worker raises.
Other to_* methods (to_arrow_batch_reader, to_iterator, to_duckdb,
to_ray, to_torch) are deliberately not touched in this round - their
order-preserving / streaming semantics deserve a separate look.
Tests cover:
- _RemainingRows correctness under unbounded, bounded, zero-request,
and 8-thread contention scenarios.
- Append-only multi-partition: parallel via method argument matches
serial byte-for-byte; parallel via table option also matches.
- Priority matrix: method argument overrides option (both directions),
option overrides default, explicit 1 keeps serial path.
- PK merge-on-read multi-bucket: parallel + serial produce the same
merged rows.
- Limit + parallel: 10 repeated runs return exactly the configured
row count.
- Edge cases: empty splits with parallelism=4, parallelism exceeding
split count, invalid method argument and invalid option value each
raise ValueError with a source-specific message.
- Reader error propagation: when one split's create_reader raises,
the exception surfaces from to_pandas and sibling readers are
cleaned up.
b762e0a to
3cf7e53
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Today
TableRead.to_pandas/to_arrowiterate splits serially in_arrow_batch_generator, so wall time scales linearly with the number of splits even though PyArrow's parquet/orc readers release the GIL during decode. Unlike Java, where Flink/Spark fan splits out across TaskManagers/Executors, PyPaimon has no external framework above the SDK; split-level parallelism therefore has to live inside the SDK.This PR adds an opt-in
max_workersparameter toto_pandas/to_arrow. Default behavior is unchanged.Linked issue
N/A — direct contribution.
API
max_workers=None(default) or1→ original serial path, no thread pool created.>= 2with at least 2 splits →ThreadPoolExecutorruns splits concurrently; the finalTableis assembled in the input splits' order (results collected by submission index).< 1→ValueError.Other
to_*methods (to_arrow_batch_reader,to_iterator,to_duckdb,to_ray,to_torch) are intentionally untouched — their order-preserving / streaming semantics deserve a separate look.Correctness under
limit_RemainingRowsis a thread-safe row-quota counter shared by all workers. Quota is pre-debited under a single lock so the combined output never exceedsself.limit, even if individual readers decode one extra batch after the quota is gone (the surplus batch is simply dropped, never emitted).Resource handling
Each worker uses
try/finally: reader.close().ThreadPoolExecutor's wait-on-exit guarantees every started reader is closed beforeto_arrowreturns, even when one worker raises and propagates its exception.Tests
Added
paimon-python/pypaimon/tests/reader_parallel_test.py(16 tests):_RemainingRows: unbounded, bounded pre-debit, zero-request, 8-thread contention.limit+ parallel: 10 repeated runs return exactly the configured row count.max_workers=4,max_workersexceeding split count,max_workers=0/-1rejected,max_workers=1matches serial,include_row_kind=Trueparity.create_readerraises, the exception surfaces fromto_pandasand sibling readers are cleaned up.API / format impact
CoreOptions: no new option introduced in this round.Documentation impact
Docstrings on
to_arrow/to_pandasupdated. Design doc added atpaimon-python/docs/design/2026-05-15-pypaimon-parallel-to-pandas.md. README untouched.Generative AI disclosure
Yes — the implementation, tests, and design doc were drafted with Claude Code assistance under my direction and review.