Skip to content

fix(agent): Langfuse 4.x flush — correct env vars + remove explicit flush#138

Merged
ikuzuki merged 1 commit intomainfrom
fix/langfuse-timeout-correction
Apr 20, 2026
Merged

fix(agent): Langfuse 4.x flush — correct env vars + remove explicit flush#138
ikuzuki merged 1 commit intomainfrom
fix/langfuse-timeout-correction

Conversation

@ikuzuki
Copy link
Copy Markdown
Owner

@ikuzuki ikuzuki commented Apr 20, 2026

Summary

Corrects #137's Langfuse re-enablement, which post-deploy caused Task timed out after 60 seconds on every /team request and forced an emergency kill-switch flip. The fix is two env-var corrections plus removing explicit langfuse_flush() from the request path.

Root cause (source-code level, see ADR-0005 revision)

Langfuse 4.x has two separate blocking surfaces on the response thread. #137 addressed neither.

1. OTLPSpanExporter retry loop. Exponential backoff (1+2+4+8+16+32 ≈ 63s). Bounded only by the exporter's timeout= ctor arg, which Langfuse wires to LANGFUSE_TIMEOUT (seconds, default 5). The OTEL_BSP_EXPORT_TIMEOUT and OTEL_EXPORTER_OTLP_TIMEOUT env vars I set in #137 are documented-ignored by OpenTelemetry's OTLP HTTP exporter:

# Not used. No way currently to pass timeout to export.
opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py:164

2. resource_manager.flush()'s queue joins. Langfuse().flush() calls _score_ingestion_queue.join() with no timeout. /chat emits scores via _emit_quality_scores, so if Langfuse Cloud is slow, .join() blocks the handler thread indefinitely. Not configurable in 4.0.6 (Langfuse PR #1618 ships a span_exporter= kwarg that works around this in 4.1+).

Bonus footgun: LANGFUSE_FLUSH_AT=1 (which I thought reduced queue pressure) actually sets max_export_batch_size=1 on BatchSpanProcessor — which makes on_end() trigger synchronous export on the handler thread for every span. Moves the blocking upload onto the user-facing request path, the exact behaviour the background thread exists to prevent.

Fix

lambda.tf

  • Add LANGFUSE_TIMEOUT="2" — caps the retry loop at ~2s per flush.
  • Add LANGFUSE_FLUSH_INTERVAL="1" — background thread drains queue every 1s.
  • Remove OTEL_BSP_EXPORT_TIMEOUT, OTEL_EXPORTER_OTLP_TIMEOUT (ignored).
  • Remove LANGFUSE_FLUSH_AT (counterproductive).
  • Keep LANGFUSE_TRACING_ENABLED="false" — kill-switch stays while we verify the fix. Flip to "true" via CLI after apply; Terraform default promotes only after confirmed.

api.py

  • Remove all three langfuse_flush() calls from /chat and /chat/sync handlers. The background batch processor drains spans + scores between requests on its own thread; explicit flush was what fell through into the unbounded _score_ingestion_queue.join().
  • Drop the now-unused flush as langfuse_flush import.

Tests

  • Deleted two tests that asserted the handler calls langfuse_flush() — we deliberately removed that behaviour. Comment in place where they used to live points to the ADR revision.
  • 150 agent + lib tests pass locally.

Docs

  • ADR-0005 revision — fully rewritten with:
    • The two blocking surfaces
    • What we tried first and why it failed
    • Source-code citations (langfuse/_client/client.py:269, span_processor.py:108-112, resource_manager.py:430, the OTel __init__.py:174-224 retry loop)
    • ADOT stays deferred, with span_exporter= in Langfuse 4.1+ as the cleaner escape hatch

Rollout plan

  1. Merge + CI terraform apply — changes the env vars, Langfuse stays OFF.
  2. Verify /team + /chat still work on the deployed (untraced) Lambda.
  3. CLI-flip LANGFUSE_TRACING_ENABLED=true on the live Lambda.
  4. Hit /team + /chat, measure latency. Should be <3s even if Langfuse is slow.
  5. If good → follow-up PR promotes LANGFUSE_TRACING_ENABLED="true" in Terraform.
  6. If bad → CLI-flip back to false (same kill-switch), file a follow-up.

Why this time will work (confidence check)

The research for #137 was docs-level and got the wrong answer. This time the research was source-level — agent traced the exact call chain in the actual installed langfuse==4.0.6 and OpenTelemetry SDK. The OTEL exporter source literally has a comment saying the timeout env var is ignored. LANGFUSE_TIMEOUT=2 is wired straight into the exporter ctor — verifiable by reading the SDK. The unbounded .join() path on _score_ingestion_queue is also visible in the SDK source.

The remaining risk is the background batch processor dropping spans when the Lambda freezes mid-batch. That's the accepted trade-off at our scale (1-2 rpm); if the Langfuse UI shows visible drop rate, ADOT is the next step.

Test plan

  • 150 agent + lib tests pass locally
  • ruff + terraform validate clean
  • CI green
  • After deploy + CLI flip: /team returns in <3s even with tracing on
  • After deploy + CLI flip: a trace appears in the Langfuse UI for each request
  • After deploy + CLI flip: /chat streams end-to-end without handler hang

🤖 Generated with Claude Code

…emove explicit flush

#137 re-enabled Langfuse with OTEL_BSP_EXPORT_TIMEOUT + OTEL_EXPORTER_OTLP
_TIMEOUT, which are documented-ignored by the OTLP HTTP exporter
(# Not used. No way currently to pass timeout to export.) and by
Langfuse's own LangfuseSpanProcessor. Every /team request post-deploy
hit Lambda's 60s timeout, forcing an emergency CLI flip of
LANGFUSE_TRACING_ENABLED=false.

Deep source-code research on langfuse-python v4.0.6 identified two
separate blocking surfaces:

1. OTLPSpanExporter retry loop — 6 retries with exponential backoff
   (~63s total). Bounded only by the exporter's own `timeout=` ctor
   arg, which Langfuse wires to the LANGFUSE_TIMEOUT env var
   (seconds, default 5). That's the knob, not the OTEL_* ones.
2. resource_manager.flush() → _score_ingestion_queue.join() — no
   timeout. If we emit scores (/chat does, via _emit_quality_scores)
   and Langfuse is slow, .join() blocks indefinitely on the handler
   thread. This is unfixable without the SDK's upcoming v4.1
   span_exporter= kwarg.

Additionally, LANGFUSE_FLUSH_AT=1 was counterproductive: it sets
max_export_batch_size=1, making BatchSpanProcessor.on_end() trigger
synchronous export on the handler thread for every span — moving the
blocking HTTP upload onto the user-facing request path, which is what
the background thread exists to prevent.

Fix:
- lambda.tf: LANGFUSE_TIMEOUT=2 (caps retry loop); LANGFUSE_FLUSH_INTERVAL
  =1 (frequent background flush); remove OTEL_BSP_EXPORT_TIMEOUT, OTEL
  _EXPORTER_OTLP_TIMEOUT, LANGFUSE_FLUSH_AT. LANGFUSE_TRACING_ENABLED
  stays "false" in Terraform — flip to true via CLI after apply once
  production latency is verified.
- api.py: remove three explicit langfuse_flush() calls from /chat and
  /chat/sync handlers. The background batch processor drains spans +
  scores on its own thread; explicit flush was the path through the
  unbounded _score_ingestion_queue.join().

Tests: two tests that mocked langfuse_flush() were deleted — they
asserted behaviour we deliberately removed. 150 tests pass. ruff,
terraform validate clean.

Docs: ADR-0005 revision expanded with source-level citations (SDK
line numbers) so the next person who touches this doesn't redo the
research.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ikuzuki ikuzuki merged commit 59ad906 into main Apr 20, 2026
9 checks passed
@ikuzuki ikuzuki deleted the fix/langfuse-timeout-correction branch April 20, 2026 22:09
ikuzuki added a commit that referenced this pull request Apr 21, 2026
Three rounds of env-var tuning (#133 / #137 / #138) failed to cap the
request-path hang at <60s. Test on 2026-04-21 with LANGFUSE_TIMEOUT=2
confirmed live still hit Lambda's 60s timeout on /team; the actual
blocking call was not root-caused.

- Drop LANGFUSE_TIMEOUT and LANGFUSE_FLUSH_INTERVAL from lambda.tf —
  both are no-ops when tracing is disabled, keeping them was cargo.
- Rewrite the comment next to LANGFUSE_TRACING_ENABLED="false" to
  reflect the parked decision rather than the stale "flip via CLI
  after apply" plan.

Enrichment services retain tracing; they run in normal Lambda (no LWA,
no streaming) and don't show this class of hang. Re-entry path is
either a local reproduction with a debugger attached or switching to
the ADOT Lambda Extension — not more env-var guessing.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant