Summary:
This issue aims to start a discussion around improving extensibility in datafusion-distributed, especially for custom plan annotations and network boundaries. I would appreciate insights from the DataFusion community on potential design directions and best practices.
Challenge and Motivation
I believe there is significant value in expanding the extensibility of datafusion-distributed (DFD). The project’s core strengths—plan annotation, insertion of network boundaries, and distribution of sub‑plans to workers—make it a natural place for more flexible customization.
My colleagues and I have been working toward implementing custom network boundaries and plan annotations in a fork of DFD. The use case involves inserting multiple ExecutionPlan nodes instead of relying solely on NetworkShuffleExec, NetworkCoalesceExec, or NetworkBroadcastExec. In practice, this requires a mechanism to introduce custom plan annotations and network boundaries beyond what DFD currently supports.
An initial attempt at introducing this extensibility can be found in this draft PR by @kurtvolmar:
kurtvolmar#1.
However, given that DataFusion itself already provides many extension points, it may not be ideal for DFD to introduce a separate, parallel extensibility framework. Instead, it seems likely that a DataFusion‑native mechanism could be leveraged to support more flexible behavior within DFD.
Proposal and Call for Collaboration
The existing DistributedPhysicalOptimizerRule encapsulates a substantial amount of logic. One possible direction would be to decompose this rule into smaller, more focused components—such as a PlanAnnotationRule and NetworkBoundaryRule—and expose configuration or hooks that allow users to implement custom logic where needed.
Community input would be highly valuable, particularly around:
- Whether splitting
DistributedPhysicalOptimizerRule into smaller, pluggable rules aligns with the project’s direction.
- Alternative approaches in DataFusion that could enable the desired extensibility without modifying DFD directly.
- Prior art or patterns in the DataFusion ecosystem that could help inform a clean design.
Feedback, suggestions, or discussion from maintainers and contributors would be greatly appreciated. The goal is to collaborate on a design that increases flexibility without adding unnecessary complexity to DFD.
cc: @gabotechs
Summary:
This issue aims to start a discussion around improving extensibility in
datafusion-distributed, especially for custom plan annotations and network boundaries. I would appreciate insights from the DataFusion community on potential design directions and best practices.Challenge and Motivation
I believe there is significant value in expanding the extensibility of
datafusion-distributed(DFD). The project’s core strengths—plan annotation, insertion of network boundaries, and distribution of sub‑plans to workers—make it a natural place for more flexible customization.My colleagues and I have been working toward implementing custom network boundaries and plan annotations in a fork of DFD. The use case involves inserting multiple
ExecutionPlannodes instead of relying solely onNetworkShuffleExec,NetworkCoalesceExec, orNetworkBroadcastExec. In practice, this requires a mechanism to introduce custom plan annotations and network boundaries beyond what DFD currently supports.An initial attempt at introducing this extensibility can be found in this draft PR by @kurtvolmar:
kurtvolmar#1.
However, given that DataFusion itself already provides many extension points, it may not be ideal for DFD to introduce a separate, parallel extensibility framework. Instead, it seems likely that a DataFusion‑native mechanism could be leveraged to support more flexible behavior within DFD.
Proposal and Call for Collaboration
The existing
DistributedPhysicalOptimizerRuleencapsulates a substantial amount of logic. One possible direction would be to decompose this rule into smaller, more focused components—such as aPlanAnnotationRuleandNetworkBoundaryRule—and expose configuration or hooks that allow users to implement custom logic where needed.Community input would be highly valuable, particularly around:
DistributedPhysicalOptimizerRuleinto smaller, pluggable rules aligns with the project’s direction.Feedback, suggestions, or discussion from maintainers and contributors would be greatly appreciated. The goal is to collaborate on a design that increases flexibility without adding unnecessary complexity to DFD.
cc: @gabotechs