Skip to content

fix: stop holding pooled DB connections across embedder/LLM/reranker calls#2434

Draft
zommiommy wants to merge 2 commits into
vectorize-io:mainfrom
zommiommy:fix/db-conn-not-held-across-embed-llm
Draft

fix: stop holding pooled DB connections across embedder/LLM/reranker calls#2434
zommiommy wants to merge 2 commits into
vectorize-io:mainfrom
zommiommy:fix/db-conn-not-held-across-embed-llm

Conversation

@zommiommy

@zommiommy zommiommy commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Problem

Several memory-engine paths held a pooled PostgreSQL connection checked out for the entire duration of a slow external call — an embedder, reranker, or LLM round-trip.

The pools are already bounded (asyncpg, default max_size=20) and per-process — the API and the background worker each build their own MemoryEngine/pool — so this is not a leak or unbounded growth. The failure mode is saturation:

  • The background consolidation worker is the main source. _process_one_llm_batch held one connection for a whole batch — across the consolidation recall, the consolidation LLM call, per-action embeddings, and the dedup recall/embed/LLM. Under load, enough concurrent batches park the worker's entire pool on these multi-second calls; the poller and remaining batches then block until acquire_timeout. The worker starves itself.
  • Two API-side edit paths embedded while holding a connection: update_memory_unit re-embedded mid-transaction, and update_mental_model embedded inside the acquired connection.
  • Because the parked connections sit idle mid-call, this also inflates pressure on the server's global max_connections (the sum of every process's pool) for no useful work.

(The hot recall/rerank path was already clean: the query is embedded before the connection is acquired, and reranking runs after retrieval returns.)

Why a bounded / bulkhead pool wasn't enough

The obvious smaller fix — bound or bulkhead the pool — doesn't address the real failure:

  • The pools are already bounded; capping them tighter just makes saturation arrive sooner, and a slow call still parks its connection for its full duration.
  • A dedicated consolidation sub-pool (bulkhead) was considered. It would stop batch work from starving the worker's other duties, but it only self-throttles consolidation — slow LLMs still park the sub-pool, so throughput is capped at sub-pool size — and it leaves the API-side update_memory_unit / update_mental_model hold-across-embed paths untouched.

To keep consolidation throughput decoupled from connection count, and to shrink the connection footprint everywhere, the connection has to be released across the slow call. That's what this PR does.

Approach

Invariant enforced across the changed paths: no pooled DB connection is held across an embedder / reranker / LLM await.

  • Connections are acquired short-lived, only around SQL. Embeddings, the LLM call, and dedup adjudication run with no connection held.
  • Consolidation: action executors and dedup helpers take the backend (pool) and self-acquire a short connection around their SQL. Each source-liveness check is paired with its write in one tiny transaction (this strengthens PG — previously autocommit-per-statement — and normalizes Oracle). Dedup folds are RETURNING-gated and re-filter live sources inside the fold transaction (FOR SHARE, sources-before-observation lock order, matching the normal write paths), so a twin or source deleted during the now connection-free window can't drop a CREATE or fold a dead source id.
  • update_memory_unit: split into a read/resolve + embed phase (off-connection, into typed _MemoryEditPlan / _MemoryRevertPlan) and a short write transaction that re-locks the row (FOR UPDATE), filters surviving entities, and applies the precomputed embedding — aborting (rollback) if a concurrent edit changed an embedding-input column between the phases.
  • update_mental_model: embeds before acquiring (its text depends only on its arguments).

This is a large change — and why

Moving the embed/LLM out of the held connection widens the read→write window, so this isn't a one-line "release the connection." The held transaction used to provide serialization for free; replacing it requires new guards to preserve the same correctness:

  • atomic check-then-write transactions,
  • RETURNING-gated, live-source-filtered dedup folds with a consistent (sources-first) lock order,
  • a two-phase update_memory_unit with re-lock, abort/retry on inter-phase edits, and orphan-entity reclaim.

That's why the diff is substantial (~1.5k lines), concentrated in the two core engine modules and their test suites.

One deliberate exception

A few rare recovery paths re-embed inside the Phase-2 transaction: when a concurrent graph-maintenance prune (or a same-unit entity-only edit) lands between the phases, update_memory_unit / revert re-embed under the lock so the stored vector stays consistent with unit_entities. These are bounded to a single re-embed and pinned by tests. There is no retry wrapper around update_memory_unit (a RuntimeError surfaces as HTTP 500), so converting them to abort/retry would regress interactive edits — the bounded in-txn re-embed is the intentional fallback. It can be converted to strict abort/retry later if a "never embed in-transaction" guarantee is preferred.

Verification

  • ruff + ty clean (changed files).
  • Deterministic no-DB unit suites pass — they exercise the dedup decision and fold guards (RETURNING-gating, live-source filtering, created/skipped propagation) and the pre-embed guards directly.
  • Real PostgreSQL (pgvector pg18): the consolidation, memory-curation, observation-invalidation, mental-model (mock-LLM), and document-transfer suites pass against a live DB. The only non-passes are real-LLM tests that require provider credentials, unrelated to this change.

@zommiommy zommiommy marked this pull request as draft June 27, 2026 10:05
@zommiommy zommiommy force-pushed the fix/db-conn-not-held-across-embed-llm branch from 3691864 to 91c4e30 Compare June 27, 2026 11:09
…LLM calls

The background consolidation worker held a single pooled PostgreSQL
connection for an entire LLM batch -- across the consolidation recall, the
consolidation LLM call, per-action embeddings, and the dedup
recall/embed/LLM. Each MemoryEngine pool is bounded (default max 20), so
under load these multi-second external calls park the worker's whole pool
and the poller and remaining batches block on connection acquisition: the
worker saturates its own pool. Idle-but-checked-out connections also add
needless pressure on the server's global max_connections.

Acquire connections short-lived, only around SQL; run embeddings, the LLM
call, and dedup adjudication with no connection held. The action executors
and dedup helpers now take the backend (pool) and self-acquire around their
own writes.

Moving the slow calls off the connection widens the decision-to-write
window, so the held-transaction serialization is replaced with explicit
guards that preserve the same correctness:

- Each source-liveness check is paired with its write in one short
  transaction (atomic; previously autocommit-per-statement on PostgreSQL,
  normalized on Oracle).
- Dedup CREATE/UPDATE folds are RETURNING-gated and re-filter live source
  memories inside the fold transaction (FOR SHARE, sources-before-
  observation lock order, matching _create_observation_directly and
  _execute_update_action), so a twin or source deleted during the now
  connection-free window cannot drop a CREATE or fold a dead source id.
- Skip embedding entirely when every source memory is already dead.
- _execute_create_action reports its action so _process_memory_batch counts
  a memory as created only when an observation was actually written.

The deterministic tests need no DB or LLM and pin the fold guards
(RETURNING gate, live-source filtering, created/skipped propagation) and
the pre-embed short-circuits directly.
update_memory_unit re-embedded mid-transaction and update_mental_model
embedded inside the acquired connection, so an interactive edit held a
pooled connection across the embedder call.

update_mental_model now embeds before acquiring (its text depends only on
its arguments).

update_memory_unit is split into two phases:

- Phase 1 reads, resolves, and embeds off any connection, into typed
  _MemoryEditPlan / _MemoryRevertPlan.
- Phase 2 is a short write transaction that re-locks the row (FOR UPDATE),
  filters surviving entities, and applies the precomputed embedding.

The wider read-to-write window is covered by explicit guards:

- Abort (rollback and retry) if a concurrent edit changed an embedding- or
  resolution-input column between the phases.
- Reclaim orphan entities when an entity edit fails or is skipped, and
  re-lock the resolved set FOR UPDATE before linking to avoid an FK race
  with concurrent graph-maintenance pruning.

One bounded exception remains by design: when a concurrent prune (or a
same-unit entity-only edit) lands between the phases, Phase 2 and the
revert path re-embed once under the lock so the stored vector stays
consistent with unit_entities. There is no retry wrapper around
update_memory_unit, so this is preferred over surfacing a transient error,
and it is pinned by the curation tests.

Tests cover the two-phase edit, revert, concurrent-edit abort, orphan
reclaim, the prune-recovery re-embeds, observation invalidation, and
document transfer under the new acquisition model.
@zommiommy zommiommy force-pushed the fix/db-conn-not-held-across-embed-llm branch from 91c4e30 to 7ac205a Compare June 27, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant