fix(instrumented_exec): close spans when streams finish, not on plan#54
fix(instrumented_exec): close spans when streams finish, not on plan#54antonmry wants to merge 7 commits intodatafusion-contrib:mainfrom
Conversation
…drop - Replace `OnceLock<Arc<Recorder>>` with `Mutex<Option<Weak<Recorder>>>` so recorders (and their spans) are kept alive only by active streams. - Have `NodeRecorder` own the span and expose it via `span()`. - Add regression tests covering span lifetime vs. plan and stream drops.
… happen while parent execution spans are active
|
@geoffreyclaude I was able to reproduce the issue, but I'm not so sure about the fix, in particular with the required changes in the snapshots. I would appreciate your feedback before I continue with it |
@antonmry I think you're on the right track, but I believe there’s still one tricky edge case here. The current state is still tied to the plan node itself, and seems to assume each partition is executed once. But I'm not sure what the proper solution here is though! |
- Replace per-partition completion vector with an active-stream counter so partial or repeated partition executions close the span correctly. - Release the recorder slot when `inner.execute` fails with no active streams. - Add tests for concurrent, partial, and repeated partition execution.
|
@geoffreyclaude excellent point, thank you. I've committed d554065 addressing it. Would that work? Numbers are still good:
|
- Add a process-wide async mutex so snapshot test cases run one at a time. - Clear the shared log buffer at the start of each case to avoid cross-test bleed.
geoffreyclaude
left a comment
There was a problem hiding this comment.
Thanks a lot for continuing to push this! The active-stream change fixes the weird cases of execute being called multiple times.
But I think there is still one minor issue left. Overlapping/concurrent executions will share the same recorders and so the same span, when we probably want one different span per execution call.
Maybe keying the recorders by TaskContext (to avoid mixing up two concurrent plan executions on different context) and by "has this partition been seen" (to avoid mixing up concurrent calls in the same context and on the same partition) would work?
To be clearer, for both these cases, we want two different spans:
different contexts
let s1 = plan.execute(0, ctx_a.clone())?;
let s2 = plan.execute(1, ctx_b.clone())?;duplicate executions
let s1 = plan.execute(0, ctx.clone())?;
let s2 = plan.execute(0, ctx.clone())?;Sorry for being so strict about this! Your initial fix of the long drops is great, but it uncovered a whole class of preexisting concurrency issues.
- Track recorder groups as a vec keyed by task context and reserved partitions so independent or duplicate executions get fresh spans. - Reserve the stream before `inner.execute` and release on failure to avoid leaking active recorder groups. - Add tests covering distinct task contexts, execute errors, and overlapping duplicate-partition streams.
|
Hey @geoffreyclaude, I appreciate the careful review. I’m still getting familiar with this part of the codebase, so I’m very happy to keep iterating until we have a solid fix. Please keep the comments coming. I added 13a90c0 following your suggested direction. Let me know what you think |
Sort RecursiveQueryExec close events in the recursive trace snapshots so the assertion no longer depends on the non-deterministic order in which child execution streams finish across runtimes.
Which issue does this PR close?
Rationale for this change
When
datafusion-tracingis used with an OpenTelemetry subscriber backed bySimpleSpanProcessor, dropping anArc<dyn ExecutionPlan>could block thecalling thread for several seconds.
Each
InstrumentedExecnode held a strongArcto its recorders and a cloneof its tracing
SpaninOnceLockfields. Those fields lived until the planitself was dropped. With
SimpleSpanProcessor, every span close callsfutures_executor::block_on(exporter.export(...))synchronously, so:A query plan with 6 nodes and a 50 ms/span exporter blocked
drop()for~328 ms. At real-world OTLP latencies (200 ms to Jaeger under load) and larger
plans (40 nodes), this matches the 8-second drop reported in the issue.
What changes are included in this PR?
InstrumentedExecnow holdsMutex<Option<Weak<*Recorder>>>instead ofOnceLock<Arc<*Recorder>>, and no longer stores aSpanclone at all.get_or_create_node_recorder()creates the span (insideNodeRecorder) onthe first
execute()call and stores only aWeak— theArcis owned bythe execution stream.
MetricsRecorderandPreviewRecorderfollow the same pattern.the recorders drop, span fields are recorded, and the span closes.
Weakpointers — no span I/O, no blocking.A new
span()accessor was added toNodeRecordersoexecute()can get thespan from the recorder instead of a separate
OnceLock.Are these changes tested?
A self-contained reproducer is available at:
https://gist.github.com/antonmry/7f2355b83cd1cf2c37ea92bea7cd245b
It simulates a slow OTLP backend (50 ms/span) and measures
drop()time forboth
SimpleSpanProcessorandBatchSpanProcessor.Before the fix (published
datafusion-tracing = "53.0.0"):After the fix (this PR):
SimpleSpanProcessordrop time drops from 328 ms → 161 µs (~2000×).Both processors are now equivalent from the caller's perspective:
drop()isinstant in both cases because all span work completes during
collect().The existing unit-test suite passes without changes.
Are there any user-facing changes?
Spans now close when the last execution stream is consumed rather than when the
plan is dropped. This is a behavioral improvement: span end times more
accurately reflect when execution actually finished. There are no API changes.