Skip to content

[core] Optimize Flink BTree index topology#7852

Merged
JingsongLi merged 1 commit into
apache:masterfrom
leaves12138:codex/flink-btree-single-topology
May 23, 2026
Merged

[core] Optimize Flink BTree index topology#7852
JingsongLi merged 1 commit into
apache:masterfrom
leaves12138:codex/flink-btree-single-topology

Conversation

@leaves12138
Copy link
Copy Markdown
Contributor

What changed

  • Reworked Flink BTree global index building to use one task-driven topology for all contiguous row ranges instead of building one topology per range.
  • Added an internal build task id to the sort key so each range keeps its own row-range metadata while sharing the same Flink source/read/sort/write chain.
  • Added coverage for parallelism calculation, many small ranges, and a single large range split across multiple writer subtasks.

Why

When row ranges are highly fragmented, the old implementation creates a separate Flink topology for each range. That can make the create-index procedure spend a long time constructing the JobGraph and can produce an oversized topology.

Validation

  • mvn -pl paimon-flink/paimon-flink-common -DfailIfNoTests=false -Dtest=BTreeIndexTopoBuilderTest test
  • mvn -pl paimon-flink/paimon-flink-common -Pfast-build -DfailIfNoTests=false -Dtest=BTreeGlobalIndexITCase#testBTreeIndexWithManyPartitions test
  • mvn -pl paimon-flink/paimon-flink-common -Pfast-build -DfailIfNoTests=false -Dtest=BTreeGlobalIndexITCase#testBTreeIndexWithSingleRangeAndParallelWriters test

@leaves12138 leaves12138 changed the title [codex] Optimize Flink BTree index topology [core] Optimize Flink BTree index topology May 18, 2026
@leaves12138 leaves12138 marked this pull request as ready for review May 20, 2026 14:06
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [core] Optimize Flink BTree index topology

Nice optimization. Replacing N separate Flink topologies (one per row range) with a single unified topology keyed by a synthetic buildTaskId sort prefix is a clean approach to reducing JobGraph construction overhead.

Correctness

The overall design is sound:

  • The buildTaskId field is prepended as the primary sort key, so after range-shuffle + local-sort, data within each writer subtask is guaranteed monotonically ordered by (taskId, indexColumn). Task transitions are one-directional within each subtask.
  • flushCurrentWriter() correctly handles both task-boundary flush and within-task overflow flush.
  • The ReadDataOperator output type matches sortReadType (with the taskId column), and the taskId column survives the sort.

Suggestions

  1. BUILD_TASK_ID_FIELD_ID = -1 -- The choice of a negative field ID avoids collision with real schema field IDs. A short comment documenting this invariant would help future readers.

  2. buildTasksById HashMap rebuilt in every parallel writer subtask -- Each subtask independently reconstructs the full map. For a small number of tasks this is negligible, but could be restricted to only relevant tasks in the future.

  3. Parallelism calculation -- Integer division (totalRecords / recordsPerRange) means 1500 records with recordsPerRange=1000 yields parallelism=1. Matches old behavior but worth noting.

  4. BTreeSplitTask.split field relies on Java-Serializable -- Slightly tighter coupling than the previous Flink TypeInformation-based serializer.

Tests

Good coverage: unit tests for calculateParallelism and IT tests for end-to-end flow. Overall a well-structured change. LGTM with minor suggestions.

@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit f840232 into apache:master May 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants