Skip to content

Patch @nicnocquee/dataqueue to fix processor stalling under groupConcurrency#786

Merged
kevinhermawan merged 4 commits into
mainfrom
fix/dataqueue-continuous-pool-patch
Jun 12, 2026
Merged

Patch @nicnocquee/dataqueue to fix processor stalling under groupConcurrency#786
kevinhermawan merged 4 commits into
mainfrom
fix/dataqueue-continuous-pool-patch

Conversation

@kevinhermawan

Copy link
Copy Markdown
Contributor

Summary

The Hermes data-plane processor was stalling after a partial pipeline run. With concurrency=3 and groupConcurrency=3, the old processor claimed a batch of 3 jobs and then sat idle until all three finished before polling again. One slow job would hold up the two freed slots for the rest of the batch — so pending pipeline jobs went unclaimed for minutes while workers had capacity.

This PR patches @nicnocquee/dataqueue@1.39.0 locally with pnpm patch while the upstream fix (nicnocquee/dataqueue#41) awaits review. The patch replaces the batch-barrier loop with a continuous pool: each job calls pump() from its .finally(), refilling its slot the moment it finishes.

Related issues

Closes #785

Important changes

  • patches/@nicnocquee__dataqueue@1.39.0.patch — rewrites startInBackground in both dist/index.cjs and dist/index.js. The old intervalId/currentBatchPromise state is replaced with pollTimer/claimInProgress/inFlight/inFlightJobs. A new pump() function claims up to min(concurrency - inFlight, batchSize) jobs and each job's .finally() decrements inFlight and calls void pump() again, so freed slots are filled immediately rather than waiting for the whole batch to settle.
  • package.json — adds pnpm.patchedDependencies entry for @nicnocquee/dataqueue@1.39.0.
  • pnpm-lock.yaml — updated to resolve the patched virtual store entry.

Other changes

None.

Key files to review

  • patches/@nicnocquee__dataqueue@1.39.0.patch — the complete fix applied to both CJS and ESM builds.
  • package.jsonpnpm.patchedDependencies wiring.

How to test

  1. Build hermes-worker: pnpm --filter hermes-worker build — should complete without errors.
  2. Confirm the fix is bundled: grep -c claimInProgress apps/hermes/worker/dist/index.js — should return 4.
  3. Behavioral check: a processor with concurrency=2, groupConcurrency=2, 1 slow job and 5 fast jobs (same group) should complete all 5 fast jobs while the slow one is still running. This mirrors the regression test in Refill freed concurrency slots continuously instead of per batch nicnocquee/dataqueue#41, which times out on the old code and passes on the patched build.
  4. To remove this patch later: bump @nicnocquee/dataqueue past 1.39.0 and delete both the patch file and the patchedDependencies entry.

The Hermes data-plane processor stalled whenever a data-collection job ran
long. With groupConcurrency=3 the queue claimed same-group jobs in batches of
3 and waited for the whole batch to settle before claiming more, so two fast
jobs would finish and the third slow one blocked the entire pipeline group
until it timed out.

Vendor the upstream continuous-pool fix as a pnpm patch on 1.39.0 until a
fixed release is published. startInBackground now keeps up to `concurrency`
jobs in flight and refills each slot as it frees, never exceeding
groupConcurrency. Patches both dist/index.js (ESM) and dist/index.cjs (CJS).

Upstream PR: nicnocquee/dataqueue#41

Verified against a local Postgres: a slow plus five fast same-group jobs under
groupConcurrency=2 complete the fast jobs while the slow one is still
processing on the patched build, and stall (time out) on the unpatched build.
Both the hermes-dashboard and user-registration Dockerfiles explicitly
list what to COPY and omitted the patches/ directory, so pnpm install
could not find patches/@nicnocquee__dataqueue@1.39.0.patch and failed
with ENOENT in CI.
Previously, data-collection awaited the entire round's fetch batch before
persisting any URL — sources only reached the Agent Data API in a burst once
the slowest fetch in the round finished. Each URL is now persisted the moment
its own fetch and quality gates pass, via a new optional onOutcome hook on
performWebFetch that fires inside pMap per slot.
pnpm generates patch filenames like @scope__pkg@1.0.0.patch which
cannot be kebab-case by convention. Exclude the patches/ directory
from the new-file naming check the same way markdown files are.
@kevinhermawan kevinhermawan merged commit ec24a90 into main Jun 12, 2026
20 checks passed
@kevinhermawan kevinhermawan deleted the fix/dataqueue-continuous-pool-patch branch June 12, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix Hermes worker stalling when a slow job holds up the processor batch

2 participants