Skip to content

[fix](be) Refine revocable memory accounting for spill#62581

Open
mrhhsg wants to merge 1 commit intoapache:masterfrom
mrhhsg:fix_spill
Open

[fix](be) Refine revocable memory accounting for spill#62581
mrhhsg wants to merge 1 commit intoapache:masterfrom
mrhhsg:fix_spill

Conversation

@mrhhsg
Copy link
Copy Markdown
Member

@mrhhsg mrhhsg commented Apr 17, 2026

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Exclude small non-spillable revocable buffers from pipeline task revocable memory accounting and handle queries without revocable tasks when triggering memory revocation.

Release note

None

Check List (For Author)

  • Test: No need to test (commit existing staged changes only)

  • Behavior changed: Yes (revocable memory estimation and empty revocation handling are adjusted)

  • Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 17, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented Apr 17, 2026

run beut

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 53.57% (15/28) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.27% (20297/38100)
Line Coverage 36.76% (191141/519960)
Region Coverage 33.09% (148675/449361)
Branch Coverage 34.14% (64910/190112)

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented Apr 17, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. be/src/runtime/workload_management/query_task_controller.cpp:154-163
    The new tasks.empty() branch does not actually recover the paused query. WorkloadGroupMgr::handle_single_query_() only calls revoke_memory() after an earlier get_revocable_tasks() check succeeded, and then unconditionally removes the query from _paused_queries_list once revoke_memory() returns OK (be/src/runtime/workload_group/workload_group_manager.cpp:735-742). If the task list becomes empty between those two scans, this branch only logs and returns success; nobody runs the existing no-revocable-task fallback and nobody calls QueryContext::set_memory_sufficient(true) (be/src/runtime/query_context.cpp:275-285). The query stays blocked on _memory_sufficient_dependency (be/src/exec/pipeline/pipeline_task.cpp:340) and can no longer make forward progress. Please propagate a distinct result here so the caller can resume/cancel via its existing fallback path instead of treating this as a successful spill.

Critical Checkpoints

  • Goal of this PR: Partially met. The PipelineTask accounting change now matches the actual spillability threshold, but the new empty-task handling in QueryTaskController::revoke_memory() still does not safely recover the query when that edge case happens.
  • Scope and focus: Yes. The change is small and localized to BE spill accounting / query revocation paths.
  • Concurrency: Applicable, and this is where the blocking issue is. There is still a TOCTOU window between get_revocable_tasks() and revoke_memory(); the new empty-task branch is not concurrency-safe because it returns success without resuming or cancelling the blocked query.
  • Lifecycle / static initialization: No new lifecycle or static-init problems found. Keeping fragments alive while raw PipelineTask* pointers are used remains appropriate.
  • Configuration changes: None.
  • Compatibility / rolling upgrade: None.
  • Parallel code paths: The PipelineTask::_should_trigger_revoking() change is aligned with do_revoke_memory() and fragment-level revocable-task filtering; I did not find another spill-accounting path that obviously needed the same update.
  • Special conditional checks: The new MIN_SPILL_WRITE_BATCH_MEM filtering in PipelineTask is reasonable because actual revocation already uses that threshold.
  • Test coverage: Incomplete for this PR's new edge case. Existing PipelineTaskTest coverage exercises revocation thresholds, but there is no test covering the new empty-task path / paused-query recovery race in QueryTaskController.
  • Test result updates: None.
  • Observability: The added INFO log helps explain why the task list was empty, but observability alone is not enough because the control flow still leaves the query blocked.
  • Transaction / persistence / FE-BE variable passing / storage format: Not applicable.
  • Performance: The accounting refinement should reduce futile spill triggers for sub-threshold buffers; no blocking performance regression found beyond the recovery issue above.
  • Other issues: None beyond the blocking recovery bug above.

Review note: tests were not run in this review environment.

Comment thread be/src/runtime/workload_management/query_task_controller.cpp
Issue Number: None

Related PR: None

Problem Summary: Exclude small non-spillable revocable buffers from pipeline task revocable memory accounting and handle queries without revocable tasks when triggering memory revocation.

None

- Test: No need to test (commit existing staged changes only)

- Behavior changed: Yes (revocable memory estimation and empty revocation handling are adjusted)

- Does this need documentation: No
@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented Apr 17, 2026

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24561063428

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented Apr 17, 2026

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24571734974

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants