ikuzuki · ikuzuki · Apr 20, 2026 · Apr 20, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -19,6 +19,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **Infra: agent Lambda memory bumped from 1024 MB to 3008 MB** in `infrastructure/environments/dev/lambda.tf`. CloudWatch on the already-deployed `#128` image showed cold-start at ~27 seconds — Lambda's 10 s init cap tripped, the container was force-restarted, and LWA's `/health` probe only went green ~28 s in. That's outside CloudFront's 30 s origin-response window, so browser-side cold hits presented as `FPL is unreachable` on `/team` and `Stream ended before a final report` on `/chat`. The 27 s was dominated by Python imports (LangGraph + pgvector + asyncpg + langfuse + anthropic + FastAPI) on a 1024 MB Lambda's ~0.57 vCPU; Lambda scales CPU linearly with memory, so 3008 MB (~1.7 vCPU) cuts that roughly threefold into the 8–12 s range. Max observed memory usage stays under 300 MB — the knob is bought for CPU, not RAM.
 
 ### Fixed
+- **Agent: Langfuse flush no longer hangs the response thread — correct env vars + explicit-flush removal.** Follow-up to #137 after the re-enablement caused `Task timed out after 60 seconds` on every `/team` request post-deploy, forcing an emergency kill-switch flip (`LANGFUSE_TRACING_ENABLED=false` via CLI). Root cause of the hang and the fix are now documented in the expanded ADR-0005 revision — two independent blocking surfaces in Langfuse 4.x: (1) `OTEL_BSP_EXPORT_TIMEOUT` + `OTEL_EXPORTER_OTLP_TIMEOUT` are wired but ignored by OpenTelemetry's OTLP HTTP exporter (`# Not used. No way currently to pass timeout to export.` — `opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py:164`) — the actual bound on the retry loop is Langfuse's own `LANGFUSE_TIMEOUT` env var (seconds, passed straight into the exporter constructor); (2) `resource_manager.flush()` calls `_score_ingestion_queue.join()` with no timeout, so if we emit quality scores (`_emit_quality_scores` in `api.py`) and Langfuse is slow, the handler thread blocks indefinitely. Additionally `LANGFUSE_FLUSH_AT=1` was counterproductive — it sets `max_export_batch_size=1`, which makes `BatchSpanProcessor.on_end()` trigger synchronous export on the handler thread for every single span. Fix in `infrastructure/environments/dev/lambda.tf`: set `LANGFUSE_TIMEOUT="2"` (caps the retry loop at 2s) and `LANGFUSE_FLUSH_INTERVAL="1"` (background flush every 1s), remove the ignored `OTEL_*` env vars, remove `LANGFUSE_FLUSH_AT`. Fix in `services/agent/src/fpl_agent/api.py`: remove the three explicit `langfuse_flush()` calls from `/chat` and `/chat/sync` handlers — the background batch processor drains spans and scores on its own thread, and the `_score_ingestion_queue.join()` is what was hanging indefinitely. `LANGFUSE_TRACING_ENABLED` stays at `"false"` in Terraform as a deliberate kill-switch — we flip it back on via CLI after apply once production latency is verified, promoting to `"true"` in Terraform only after that. Research went to the SDK source (langfuse-python v4.0.6) to pin down exactly which env vars each blocking surface responds to.
 - **Agent: Langfuse tracing re-enabled with tight OTLP timeouts + module-scope client.** Three changes across `libs/fpl_lib/observability.py` and `infrastructure/environments/dev/lambda.tf`: (1) the `Langfuse()` client is now constructed lazily *once per cold-start* and cached at module scope — Langfuse maintainers flag per-call construction as a Lambda anti-pattern because it rebuilds the OTEL tracer provider ([discussion #7669](https://github.com/orgs/langfuse/discussions/7669)); (2) `_get_client()` short-circuits to `None` when `LANGFUSE_TRACING_ENABLED=false`, preserving the kill-switch; (3) Terraform pins `OTEL_EXPORTER_OTLP_TIMEOUT=3000`, `OTEL_BSP_EXPORT_TIMEOUT=5000`, `LANGFUSE_FLUSH_AT=1`, `LANGFUSE_FLUSH_INTERVAL=1000` so a slow or unreachable Langfuse endpoint adds at most ~5s to the response (vs. the 60s default that blew up `/team` in #133). This matches the pattern Langfuse's own docs recommend for low-traffic Lambda deployments and is what almost every real-world Lambda + Langfuse deployment on GitHub uses; ADOT Lambda Extension remains the production answer at higher RPS and is parked as a follow-up pending observed trace-drop rate. Four new unit tests pin the module-scope + kill-switch semantics (`test_client_is_constructed_once_across_calls`, `test_flush_short_circuits_when_tracing_disabled`, `test_record_llm_usage_short_circuits_when_tracing_disabled`, `test_client_construction_failure_is_swallowed`). Rationale + ADOT comparison documented in ADR-0005 revision (2026-04-20).
 - **Dashboard: step pills no longer ghost-tick ahead of the actual graph position** in `web/dashboard/src/pages/chat/StepPills.tsx`. The agent graph has a conditional edge — `reflector` can route back to `planner` for another round when it decides more data is needed — so for any query that triggers a second iteration, the stream emits `planner → tool_executor → reflector → planner → tool_executor → …` The previous renderer built a permanent `seen: Set<string>` from the steps array, so once *any* event for a node arrived, that pill stayed ticked forever. On loop-back the user saw "Reviewing ✓" next to "Gathering data" *spinning*, which broke the linear mental model the pills implicitly promise. Switched to position-based rendering: `currentIdx = NODE_ORDER.indexOf(lastStep)`; pills before `currentIdx` are done, the pill *at* `currentIdx` is in-flight, pills after are pending. When the graph jumps back to an earlier step, the later pills cleanly revert to the pending dot. Once the stream ends (`active = false`), everything ≤ `currentIdx` counts as done so the final state still reads as "all four completed". Four unit tests (new `StepPills.test.tsx`) pin the behaviour, including a regression case that asserts `[planner, tool_executor, reflector, planner]` renders Planning spinning + the rest pending.
 - **Dashboard: SSE parser now splits on any blank-line variant (`\r\n\r\n`, `\n\n`, or `\r\r`) and flushes the trailing buffer when the stream closes.** Fix in `web/dashboard/src/lib/agentApi.ts`. Every SSE frame the agent emits over CloudFront → Lambda Function URL ends with `\r\n\r\n` (verified via hex dump of the live stream), but the parser's `buffer.indexOf("\n\n")` never matched — `\r\n\r\n` is `\r`,`\n`,`\r`,`\n` with no adjacent LFs. The parser found *zero* frames, the `for await` loop exited without yielding, and the UI hit the `settled === false` branch and rendered "Stream ended before a final report." for every chat. Fix normalises every chunk to LF-only line endings before the frame splitter (`s.replace(/\r\n?/g, "\n")`), and flushes whatever remains in the buffer as a final frame when the reader reports `done` — some proxies close the connection without a trailing blank line, which would otherwise silently drop the terminal `result` event. Two regression tests pin both behaviours (`parses frames separated by CRLF CRLF` + `flushes a trailing frame that closes without a blank line`). Latent since #117 — masked by the Function URL 403 loop (#123 → #132) so no browser request ever reached the parser's happy path until today.

diff --git a/docs/adr/0005-prompt-versioning-and-llm-observability.md b/docs/adr/0005-prompt-versioning-and-llm-observability.md
@@ -110,11 +110,32 @@ Prompt version metadata flows into every Langfuse trace, enabling A/B comparison
 
 ## Revision — 2026-04-20: Langfuse SDK v4 on Lambda tuning
 
-Langfuse SDK v4 is an OpenTelemetry rewrite. On Lambda, the OTLP batch exporter's default timeouts (30s export + 2× 10s HTTP retries) stacked to a ~60s hang on the response thread whenever `cloud.langfuse.com` was slow or unreachable — `/team` returned HTTP 200 but took 60s to return (see CHANGELOG 2026-04-20). Root cause: Lambda freezes the execution environment on handler return, so any `flush()` call on the response thread pays the full export deadline inline.
+Langfuse SDK v4 is an OpenTelemetry rewrite. On Lambda, uploads to `cloud.langfuse.com` can stack to a 60s hang on the response thread. There are **two separate blocking surfaces** and both matter:
 
-Two accepted patterns:
+1. **The OTLP HTTP exporter's retry loop.** `OTLPSpanExporter.export()` retries up to 6 times with exponential backoff (1s → 2s → 4s → 8s → 16s → 32s ≈ 63s total) on connect errors. Bounded only by the exporter's own `timeout` argument — OTel's standard `OTEL_BSP_EXPORT_TIMEOUT` is wired but [ignored by design](https://github.com/open-telemetry/opentelemetry-python) (the HTTP exporter has the explicit comment `# Not used. No way currently to pass timeout to export.`). Langfuse 4.x passes its own `LANGFUSE_TIMEOUT` env var (seconds, default 5) straight into the exporter constructor; that's the only knob that caps the retry deadline.
+2. **`resource_manager.flush()`'s queue joins.** `Langfuse().flush()` calls `self.tracer_provider.force_flush()` (OTel default 30 s, and the passed `export_timeout_millis` is silently ignored), then `_score_ingestion_queue.join()` and `_media_upload_queue.join()` **with no timeout at all**. If we emit any scores (we do — `_emit_quality_scores` in `api.py`) and the Langfuse endpoint is slow, the handler thread blocks forever on `.join()`.
 
-1. **SDK-only config (chosen).** Module-scope `Langfuse()` client (constructed lazily on first use and cached for the warm container's lifetime) plus tight OTEL env vars — `OTEL_EXPORTER_OTLP_TIMEOUT=3000`, `OTEL_BSP_EXPORT_TIMEOUT=5000`, `LANGFUSE_FLUSH_AT=1`. Worst-case per-request latency cost: ~5s on cold start when Langfuse is unreachable. At our scale (1–2 rpm hobby traffic) this is the recommended pattern — Langfuse maintainers reiterate it in [discussion #7669](https://github.com/orgs/langfuse/discussions/7669) and it's what almost every real-world Lambda + Langfuse deployment uses.
-2. **ADOT Lambda Extension.** Runs a local OpenTelemetry Collector as an external extension process. The app exports to `localhost:4318`; the collector buffers and flushes during the extension lifecycle (after the handler returns but before the container freezes). User-facing request latency is untouched. Widely regarded as overkill below double-digit RPS, and adds Dockerfile complexity (the collector has to be baked into the container image — Layers don't work with container-image Lambdas).
+### What we tried first and why it failed
 
-Pattern 2 is the production answer if (a) trace-drop rate becomes visible in the Langfuse UI, or (b) the ~5s worst-case cold-start tax shows up as user-visible latency. Pattern 1 remains the default. `LANGFUSE_TRACING_ENABLED=false` is preserved as a kill-switch — a single `aws lambda update-function-configuration` call disables tracing in seconds if a future Langfuse outage overruns the 5s cap.
+The first correction set `OTEL_EXPORTER_OTLP_TIMEOUT=3000` and `OTEL_BSP_EXPORT_TIMEOUT=5000` — both ignored by Langfuse's own processor, so `/team` hung 60s after redeploy. We also set `LANGFUSE_FLUSH_AT=1` thinking it would reduce queue depth; it actually *made things worse* because `BatchSpanProcessor.on_end()` triggers synchronous export on the handler thread whenever `queue_size >= max_export_batch_size`. With `max=1`, every span paid the full retry loop inline.
+
+### What actually works (chosen)
+
+- **`LANGFUSE_TIMEOUT=2`** — seconds, caps the OTLP exporter's retry loop (per-POST socket timeout AND overall deadline). A dead endpoint now costs ~2s, not 60s.
+- **`LANGFUSE_FLUSH_INTERVAL=1`** — short background flush interval so the batch processor drains on its own thread between requests. Small tail of events may be stranded when the Lambda freezes mid-batch; acceptable at our scale.
+- **Do NOT set `LANGFUSE_FLUSH_AT=1`** (see failure mode above — leave the SDK default of 15).
+- **Remove all explicit `langfuse_flush()` calls from request handlers.** The `_score_ingestion_queue.join()` path in `resource_manager.flush()` is unbounded and the background processor covers the common case anyway.
+- **Keep `LANGFUSE_TRACING_ENABLED` as the kill-switch** — a single `aws lambda update-function-configuration` flip disables tracing in seconds.
+
+### Source citations
+
+- `langfuse/_client/client.py:269` — `timeout = timeout or int(os.environ.get(LANGFUSE_TIMEOUT, 5))`
+- `langfuse/_client/span_processor.py:108-112` — `OTLPSpanExporter(timeout=timeout)`
+- `langfuse/_client/resource_manager.py:430` — `flush()` call chain; `_score_ingestion_queue.join()` without timeout
+- `opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py:174-224` — retry loop with `_MAX_RETRYS = 6`, deadline = `time() + self._timeout`
+
+### ADOT stays deferred
+
+[AWS Distro for OpenTelemetry](https://aws-otel.github.io/) as a Lambda Extension runs a local OTEL collector as an external extension process — the app exports to `localhost:4318` and the extension flushes during the post-invocation lifecycle (before freeze), so user latency is untouched. It's the production answer at higher RPS and sidesteps every SDK-level blocking surface above. Deferred for us because (a) the SDK-level fix above is what Langfuse maintainers recommend for low-traffic Lambda, (b) ADOT on container-image Lambdas (Layers don't work) requires Dockerfile plumbing, and (c) once `LANGFUSE_TIMEOUT=2` is in place, worst case is a ~2s tax which is invisible at 1–2 rpm. Revisit if the Langfuse UI shows visible trace drop or if cold-start latency becomes user-facing.
+
+Langfuse PR [#1618](https://github.com/langfuse/langfuse-python/pull/1618) (v4.1+) adds a `span_exporter=` kwarg to `Langfuse()` letting apps inject a custom fire-and-forget or ADOT-routing exporter — cleaner escape hatch than overriding env vars. Worth revisiting after Langfuse 4.1 ships.
diff --git a/infrastructure/environments/dev/lambda.tf b/infrastructure/environments/dev/lambda.tf
@@ -228,25 +228,37 @@ module "lambda_agent" {
     SQUAD_CACHE_TABLE          = aws_dynamodb_table.squad_cache.name
     TEAM_FETCHER_FUNCTION_NAME = module.lambda_team_fetcher.function_name
 
-    # Langfuse tracing is on with the SDK pinned to tight timeouts so a slow
-    # or unreachable Langfuse endpoint adds at most ~5s to the response path
-    # (vs. the 60s default that blew up /team in #133). Matches the pattern
-    # Langfuse maintainers recommend for low-traffic Lambda deployments —
-    # ADOT Lambda Extension is the production answer at higher RPS and stays
-    # deferred pending observed trace-drop rate (see ADR-0005 revision).
-    LANGFUSE_TRACING_ENABLED = "true"
-    # OTLP exporter: give up on any single HTTP upload to cloud.langfuse.com
-    # after 3s (default 10s). Below the 5s BSP cap so one retry can still fit.
-    OTEL_EXPORTER_OTLP_TIMEOUT = "3000"
-    # BatchSpanProcessor: hard ceiling on the entire flush including retries.
-    # Lambda freezes the process the instant the handler returns, so any
-    # flush() call on the response thread blocks up to this value.
-    OTEL_BSP_EXPORT_TIMEOUT = "5000"
-    # Emit each span immediately — we'd rather pay 1 flush per span at low
-    # traffic than batch spans that could be stranded in the queue when the
-    # container freezes. At 1-2 rpm the upload cost is negligible.
-    LANGFUSE_FLUSH_AT       = "1"
-    LANGFUSE_FLUSH_INTERVAL = "1000"
+    # Langfuse tracing — off by default; the kill-switch is preserved so a
+    # single CLI flip can disable tracing in seconds if a future Langfuse
+    # outage overruns the 2s timeout we pin below. We earned this caution:
+    # #137's first attempt re-enabled tracing with OTEL_BSP_EXPORT_TIMEOUT +
+    # OTEL_EXPORTER_OTLP_TIMEOUT, which are documented-ignored by
+    # OpenTelemetry's OTLP HTTP exporter (`# Not used. No way currently to
+    # pass timeout to export.` — opentelemetry/exporter/otlp/proto/http/
+    # trace_exporter/__init__.py:164) and by Langfuse 4.x's own LangfuseSpan-
+    # Processor. Tracing is flipped back on at the CLI after this applies —
+    # once we've verified LANGFUSE_TIMEOUT actually bounds the retry loop in
+    # production, the default here moves to "true".
+    LANGFUSE_TRACING_ENABLED = "false"
+
+    # LANGFUSE_TIMEOUT is passed straight to OTLPSpanExporter(timeout=…). It
+    # bounds both the per-POST socket timeout AND the retry loop's deadline
+    # (the OTel exporter aborts further retries once backoff_seconds >
+    # deadline_sec - time()). Default is 5 seconds; we pin 2 so a dead
+    # endpoint can never cost more than ~2s on the response thread.
+    # See langfuse/_client/client.py:269 + _client/span_processor.py:108.
+    LANGFUSE_TIMEOUT = "2"
+
+    # LANGFUSE_FLUSH_INTERVAL controls the background batch-processor's
+    # schedule delay (seconds). Setting it low keeps queue depth small so a
+    # Lambda freeze between requests loses few spans. We do NOT set
+    # LANGFUSE_FLUSH_AT=1: that puts max_export_batch_size=1 on the
+    # BatchSpanProcessor, which makes on_end() call export() synchronously
+    # on the handler thread every time the queue hits 1 item — which is
+    # every span. That moves the blocking OTLP upload onto the user-facing
+    # request path, the exact behaviour the background thread is meant to
+    # prevent.
+    LANGFUSE_FLUSH_INTERVAL = "1"
 
     # Shared-secret gate: CloudFront injects this header on every origin
     # request (terraform-managed, stored in Secrets Manager). The FastAPI

diff --git a/services/agent/src/fpl_agent/api.py b/services/agent/src/fpl_agent/api.py
@@ -51,9 +51,6 @@
     observe,
     propagate_attributes,
 )
-from fpl_lib.observability import (
-    flush as langfuse_flush,
-)
 from fpl_lib.secrets import resolve_secret_to_env
 
 logger = logging.getLogger(__name__)
@@ -446,14 +443,19 @@ async def chat_sync(
             final_state = await _run_graph(graph, req.question, req.squad)
         except Exception as exc:  # noqa: BLE001 — surface any agent failure as 500
             logger.exception("agent run failed")
-            langfuse_flush()
             raise HTTPException(status_code=500, detail=f"agent_failure: {exc}") from exc
 
         _emit_quality_scores(final_state)
         await budget.record_batch(final_state.get("llm_usage", []))
         response = _agent_response(final_state).model_dump(mode="json")
 
-    langfuse_flush()
+    # NOTE: no explicit langfuse_flush() here. Langfuse 4.x's
+    # resource_manager.flush() calls _score_ingestion_queue.join() with no
+    # timeout, so a slow or unreachable Langfuse endpoint blocks the response
+    # thread indefinitely. The SDK's background batch processor (flush
+    # interval 1s) drains spans/scores on its own thread between invocations;
+    # a small tail of events may be stranded in the queue when the Lambda
+    # container freezes, which is the accepted trade-off at our scale.
     return JSONResponse(response)
 
 
@@ -479,7 +481,14 @@ async def chat_stream(
     not around ``EventSourceResponse(...)`` — because the generator runs
     async after the handler returns, and contextvars don't propagate
     across the boundary unless they're entered in the generator frame.
-    Same reason ``langfuse_flush`` is in the generator's ``finally``.
+
+    Note: we do not explicitly flush Langfuse here. ``flush()`` on Langfuse
+    4.x unconditionally blocks on ``_score_ingestion_queue.join()`` with no
+    timeout, which can hang the stream indefinitely when Langfuse Cloud is
+    slow. The background batch processor (``LANGFUSE_FLUSH_INTERVAL=1``)
+    drains spans and scores between requests on its own thread — at our
+    scale (1–2 rpm) the worst case is a small tail of events stranded when
+    the container freezes, which is acceptable.
     """
 
     async def event_generator() -> AsyncIterator[dict[str, str]]:
@@ -505,7 +514,6 @@ async def event_generator() -> AsyncIterator[dict[str, str]]:
             except Exception as exc:  # noqa: BLE001
                 logger.exception("agent stream failed")
                 yield _sse("error", {"message": str(exc)})
-                langfuse_flush()
                 return
 
             _emit_quality_scores(final_state)
@@ -516,11 +524,6 @@ async def event_generator() -> AsyncIterator[dict[str, str]]:
 
             yield _sse("result", _agent_response(final_state).model_dump(mode="json"))
 
-        # Outside propagate_attributes — flush must happen after the last
-        # event is yielded so client disconnect between steps still uploads
-        # accumulated spans.
-        langfuse_flush()
-
     return EventSourceResponse(event_generator())