[fix](be) Refine revocable memory accounting for spill#62581
[fix](be) Refine revocable memory accounting for spill#62581mrhhsg wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run beut |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Findings
be/src/runtime/workload_management/query_task_controller.cpp:154-163
The newtasks.empty()branch does not actually recover the paused query.WorkloadGroupMgr::handle_single_query_()only callsrevoke_memory()after an earlierget_revocable_tasks()check succeeded, and then unconditionally removes the query from_paused_queries_listoncerevoke_memory()returnsOK(be/src/runtime/workload_group/workload_group_manager.cpp:735-742). If the task list becomes empty between those two scans, this branch only logs and returns success; nobody runs the existing no-revocable-task fallback and nobody callsQueryContext::set_memory_sufficient(true)(be/src/runtime/query_context.cpp:275-285). The query stays blocked on_memory_sufficient_dependency(be/src/exec/pipeline/pipeline_task.cpp:340) and can no longer make forward progress. Please propagate a distinct result here so the caller can resume/cancel via its existing fallback path instead of treating this as a successful spill.
Critical Checkpoints
- Goal of this PR: Partially met. The
PipelineTaskaccounting change now matches the actual spillability threshold, but the new empty-task handling inQueryTaskController::revoke_memory()still does not safely recover the query when that edge case happens. - Scope and focus: Yes. The change is small and localized to BE spill accounting / query revocation paths.
- Concurrency: Applicable, and this is where the blocking issue is. There is still a TOCTOU window between
get_revocable_tasks()andrevoke_memory(); the new empty-task branch is not concurrency-safe because it returns success without resuming or cancelling the blocked query. - Lifecycle / static initialization: No new lifecycle or static-init problems found. Keeping
fragmentsalive while rawPipelineTask*pointers are used remains appropriate. - Configuration changes: None.
- Compatibility / rolling upgrade: None.
- Parallel code paths: The
PipelineTask::_should_trigger_revoking()change is aligned withdo_revoke_memory()and fragment-level revocable-task filtering; I did not find another spill-accounting path that obviously needed the same update. - Special conditional checks: The new
MIN_SPILL_WRITE_BATCH_MEMfiltering inPipelineTaskis reasonable because actual revocation already uses that threshold. - Test coverage: Incomplete for this PR's new edge case. Existing
PipelineTaskTestcoverage exercises revocation thresholds, but there is no test covering the new empty-task path / paused-query recovery race inQueryTaskController. - Test result updates: None.
- Observability: The added INFO log helps explain why the task list was empty, but observability alone is not enough because the control flow still leaves the query blocked.
- Transaction / persistence / FE-BE variable passing / storage format: Not applicable.
- Performance: The accounting refinement should reduce futile spill triggers for sub-threshold buffers; no blocking performance regression found beyond the recovery issue above.
- Other issues: None beyond the blocking recovery bug above.
Review note: tests were not run in this review environment.
Issue Number: None Related PR: None Problem Summary: Exclude small non-spillable revocable buffers from pipeline task revocable memory accounting and handle queries without revocable tasks when triggering memory revocation. None - Test: No need to test (commit existing staged changes only) - Behavior changed: Yes (revocable memory estimation and empty revocation handling are adjusted) - Does this need documentation: No
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Exclude small non-spillable revocable buffers from pipeline task revocable memory accounting and handle queries without revocable tasks when triggering memory revocation.
Release note
None
Check List (For Author)
Test: No need to test (commit existing staged changes only)
Behavior changed: Yes (revocable memory estimation and empty revocation handling are adjusted)
Does this need documentation: No
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)