Skip to content

diag(merge): log 404 cause + pin Job queue_name + max_jobs=1#8

Merged
brad07 merged 1 commit into
mainfrom
diag/merge-job-404
May 17, 2026
Merged

diag(merge): log 404 cause + pin Job queue_name + max_jobs=1#8
brad07 merged 1 commit into
mainfrom
diag/merge-job-404

Conversation

@brad07
Copy link
Copy Markdown

@brad07 brad07 commented May 17, 2026

Summary

  • Log a WARNING at each of the 4 branches in GET /clips/merge_jobs/{merge_job_id} that emit the same opaque 404, so we can tell why a merge job appears "not found"
  • Pin _queue_name=DEFAULT_QUEUE_NAME on Job(...) constructed by JobQueue._job() so Job.status() can detect queued-but-not-started jobs in the supoclip_tasks queue
  • Drop WorkerSettings.max_jobs from 4 -> 1 so one ffmpeg encode gets the full task CPU budget (pairs with BN-side CDK bump to 4 vCPU on prod workers)

Why

BN-side smoke test is hitting an immediate 404 on the first poll, ~1s after a successful enqueue. The handler has four code paths that all emit the same generic message — new log line tells us which fired without leaking detail to the caller.

Separately: BN-side worker tier bump from 2 -> 4 vCPU is pointless if a second concurrent job could land on the same worker and halve the per-job CPU budget. libx264 doesn't scale linearly past ~4-6 threads so packing concurrent encodes onto one worker is a net loss; horizontal scaling is the right lever.

Test plan

  • Deploy + reproduce the smoke-test 404
  • aws logs tail filter "Merge job not found" on the supoclip backend log group — should print one of:
    • info=None merge_job_id=… task_id=…
    • function mismatch … function=…
    • args mismatch … info_args=…
    • status=None after info OK …
  • Confirm worker logs show max_jobs=1 on startup

🤖 Generated with Claude Code

We're seeing 404 "Merge job ... not found" on the very first poll, ~1s
after a successful enqueue. The handler has four distinct branches that
all emit the same opaque 404 (info=None / function mismatch / args
mismatch / post-info status=None), so the log line doesn't tell us which
one fired.

Two changes:

1. Log a server-side warning at each branch so the next 404 says
   *why*. Caller still sees the same generic 404 — we don't want to
   leak the existence (or shape) of other tasks' jobs.

2. Pass `_queue_name=DEFAULT_QUEUE_NAME` through `JobQueue._job()`.
   Job.info() and Job.result() don't depend on queue_name, but
   Job.status() reads the queue ZSET (`zscore(self._queue_name, ...)`)
   to detect the `queued` state — without this, a job still waiting
   in our `supoclip_tasks` queue would report `not_found` because the
   Job handle defaulted to arq's `arq:queue`. Defensive even though
   it's almost certainly not the cause of the current 404 (info() is
   what fires first and that's queue-agnostic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5973f2ce-4369-4753-b4ae-1a3c4aae68c5

📥 Commits

Reviewing files that changed from the base of the PR and between fd06b0d and a918e62.

📒 Files selected for processing (2)
  • backend/src/api/routes/tasks.py
  • backend/src/workers/job_queue.py

📝 Walkthrough

Walkthrough

This PR hardens merge-job resolution by pinning the queue name in Job construction and adding detailed warning logs to the merge-job polling endpoint. The core fix ensures Job.status() can correctly detect queued jobs; the logging adds observability for debugging edge cases like missing info, function/args mismatches, and race-condition status evictions.

Changes

Job Queue Resolution and Logging

Layer / File(s) Summary
Job queue construction with explicit queue name
backend/src/workers/job_queue.py
JobQueue._job() now explicitly constructs Job objects with _queue_name=DEFAULT_QUEUE_NAME, ensuring status() lookups correctly detect queued jobs in the supoclip_tasks queue instead of relying on arq defaults.
Merge-job polling endpoint logging for lookup failures
backend/src/api/routes/tasks.py
get_merge_job() adds branch-specific warning logs for each job lookup failure: when get_job_info() returns None, when job function does not match merge_clips_job, when job args do not match the task ID, and when status resolution returns None after info succeeded. All paths return the same HTTP 404 response.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • Integrity-Labs/supoclip#7: Adds the async merge routes and get_merge_job polling endpoint; this PR further refines the same merge polling code paths and fixes underlying job queue lookup behavior via queue name pinning.

Poem

🐰 A queue by any name would resolve just the same,
But pinned ones remember their place all the same!
We log every edge case, each stumble and fall,
So engineers see exactly what happened at all! 📋✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes both main changes: adding diagnostic logging for the merge-job 404 branches and pinning the Job queue name for correct queue resolution.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch diag/merge-job-404

Comment @coderabbitai help to get the list of available commands and usage tips.

@brad07 brad07 merged commit b4b5458 into main May 17, 2026
8 checks passed
@brad07 brad07 changed the title diag(merge): log which security-check branch fires + pin Job queue_name diag(merge): log 404 cause + pin Job queue_name + max_jobs=1 May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant