Skip to content

fix(ui): return 409 for overlapping chat turns#1304

Open
fallintoplace wants to merge 1 commit into
amd:mainfrom
fallintoplace:fix/session-lock-conflict-1303
Open

fix(ui): return 409 for overlapping chat turns#1304
fallintoplace wants to merge 1 commit into
amd:mainfrom
fallintoplace:fix/session-lock-conflict-1303

Conversation

@fallintoplace
Copy link
Copy Markdown

Summary

  • return 409 when a second /api/chat/send request arrives for a session that already has an active turn
  • remove the unsafe force-release path for held session locks
  • update the chat concurrency unit test to assert the conflict behavior we already expect in stress coverage

Why

The old code would wait 5 seconds, call release() on the per-session asyncio.Lock, and then continue. That is not safe because asyncio.Lock does not track ownership. A slow but healthy request could still be holding the lock when a second request force-unlocked it, which opened the door to overlapping turns mutating the same session.

This change keeps the contract simple: one active turn per session, and the second request gets a 409 instead of barging through.

Closes #1303.

Testing

  • uv run pytest tests/unit/chat/ui/test_chat_concurrency.py -q
  • uv run python -m compileall src/gaia/ui/routers/chat.py tests/unit/chat/ui/test_chat_concurrency.py

@github-actions github-actions Bot added the tests Test changes label May 30, 2026
@fallintoplace fallintoplace marked this pull request as ready for review May 30, 2026 11:17
@github-actions
Copy link
Copy Markdown
Contributor

Correct fix for a real safety violation — removing the force-release path eliminates the risk of two coroutines writing to the same session concurrently, and the 409 response gives the client a clear, actionable signal. The change is small, focused, and the test update matches the new contract precisely.


Issues Found

🟢 Minor — TOCTOU window is safe in asyncio, but worth a short note (chat.py:92-98)

The locked() check and acquire() are not atomic at the language level, but they are safe here because asyncio's cooperative model never yields between them (no await between the two lines). A future reader who is unfamiliar with asyncio's single-threaded scheduler might worry about a race. Consider a one-liner comment to pre-empt the question:

    # No await between the check and acquire — safe under asyncio's cooperative scheduler.
    if session_lock.locked():
        raise HTTPException(
            status_code=409,
            detail="A chat request is already in progress for this session. "
            "Please wait for it to finish.",
        )
    await session_lock.acquire()

🟢 Minor — Frontend 409 handling not in scope but should be tracked

The router now returns 409 where it previously succeeded (or eventually succeeded after 5 s). If the Agent UI frontend doesn't already handle 409 on /api/chat/send, users will see an unhandled error instead of a "please wait" message. This PR is backend-only, so it doesn't need to fix this — but it's worth opening a follow-up or checking src/gaia/apps/webui/ for the error-handling path before this lands in a release.


Strengths

  • Diagnosis is precise. The PR description correctly identifies why the force-release was unsafe (asyncio.Lock has no ownership tracking), and the removal is targeted — nothing else in the concurrency logic is disturbed.
  • Test cleanup is correct. The old test omitted lock.release() because it expected the force-release path to do it. The new test adds the explicit lock.release() before loop.close(), which is exactly right.
  • Comment quality. The inline comment in chat.py names the invariant being enforced ("force-releasing … is unsafe because the lock has no ownership tracking") without exceeding one line — follows CLAUDE.md's comment guideline precisely.

Verdict

Approve. The nit above is optional; the fix is sound and safe to merge as-is.

@itomek itomek self-assigned this Jun 1, 2026
Copy link
Copy Markdown
Collaborator

@itomek itomek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Verified locally; this is a correct, well-scoped fix.

Thanks @fallintoplace — the diagnosis in the body matches what's in the diff, and the new path at chat.py:92-96 is the right shape.

Independent verification (since this is a fork PR and CI only ran metadata checks at the time of local review — full suite has now completed green):

  • 409 on conflict — confirmed at chat.py:92-96, asserted in the renamed TestSessionLockConflict test (409 + "already in progress").
  • Force-release path fully removed — asyncio.wait_for, the bare release(), and the inner except RuntimeError: pass are all gone.
  • Test updated and passing — 9/9 in test_chat_concurrency.py, 672 passing across the broader chat/ui suite. Lint (black + isort) clean on both files.

Two things worth affirming beyond the surface diff:

  • Removing the old release() in the timeout handler closes a real concurrency hazard, not just a style issue. asyncio.Lock doesn't track ownership, so calling release() from a coroutine that never held the lock could hand the lock to a waiter prematurely and corrupt another turn's critical section. Replacing that with a 409 is the correct fix.
  • The deleted except RuntimeError: pass also retires a pre-existing fail-quietly pattern, which moves this code in the fail-loudly direction the project wants.

The two pre-existing except Exception: pass blocks at chat.py:44-49 and chat.py:199 are out of scope here — fine to leave for a separate change.

No blocking findings. Deferring to @kovtcharov-amd for the final merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: /api/chat/send can unlock an active session after 5s and let two turns run at once

2 participants