fix(agent): Langfuse 4.x flush — correct env vars + remove explicit flush by ikuzuki · Pull Request #138 · ikuzuki/fpl-platform

ikuzuki · 2026-04-20T22:08:05Z

Summary

Corrects #137's Langfuse re-enablement, which post-deploy caused Task timed out after 60 seconds on every /team request and forced an emergency kill-switch flip. The fix is two env-var corrections plus removing explicit langfuse_flush() from the request path.

Root cause (source-code level, see ADR-0005 revision)

Langfuse 4.x has two separate blocking surfaces on the response thread. #137 addressed neither.

1. OTLPSpanExporter retry loop. Exponential backoff (1+2+4+8+16+32 ≈ 63s). Bounded only by the exporter's timeout= ctor arg, which Langfuse wires to LANGFUSE_TIMEOUT (seconds, default 5). The OTEL_BSP_EXPORT_TIMEOUT and OTEL_EXPORTER_OTLP_TIMEOUT env vars I set in #137 are documented-ignored by OpenTelemetry's OTLP HTTP exporter:

# Not used. No way currently to pass timeout to export.
— opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py:164

2. resource_manager.flush()'s queue joins. Langfuse().flush() calls _score_ingestion_queue.join() with no timeout. /chat emits scores via _emit_quality_scores, so if Langfuse Cloud is slow, .join() blocks the handler thread indefinitely. Not configurable in 4.0.6 (Langfuse PR #1618 ships a span_exporter= kwarg that works around this in 4.1+).

Bonus footgun: LANGFUSE_FLUSH_AT=1 (which I thought reduced queue pressure) actually sets max_export_batch_size=1 on BatchSpanProcessor — which makes on_end() trigger synchronous export on the handler thread for every span. Moves the blocking upload onto the user-facing request path, the exact behaviour the background thread exists to prevent.

Fix

lambda.tf

Add LANGFUSE_TIMEOUT="2" — caps the retry loop at ~2s per flush.
Add LANGFUSE_FLUSH_INTERVAL="1" — background thread drains queue every 1s.
Remove OTEL_BSP_EXPORT_TIMEOUT, OTEL_EXPORTER_OTLP_TIMEOUT (ignored).
Remove LANGFUSE_FLUSH_AT (counterproductive).
Keep LANGFUSE_TRACING_ENABLED="false" — kill-switch stays while we verify the fix. Flip to "true" via CLI after apply; Terraform default promotes only after confirmed.

api.py

Remove all three langfuse_flush() calls from /chat and /chat/sync handlers. The background batch processor drains spans + scores between requests on its own thread; explicit flush was what fell through into the unbounded _score_ingestion_queue.join().
Drop the now-unused flush as langfuse_flush import.

Tests

Deleted two tests that asserted the handler calls langfuse_flush() — we deliberately removed that behaviour. Comment in place where they used to live points to the ADR revision.
150 agent + lib tests pass locally.

Docs

ADR-0005 revision — fully rewritten with:
- The two blocking surfaces
- What we tried first and why it failed
- Source-code citations (langfuse/_client/client.py:269, span_processor.py:108-112, resource_manager.py:430, the OTel __init__.py:174-224 retry loop)
- ADOT stays deferred, with span_exporter= in Langfuse 4.1+ as the cleaner escape hatch

Rollout plan

Merge + CI terraform apply — changes the env vars, Langfuse stays OFF.
Verify /team + /chat still work on the deployed (untraced) Lambda.
CLI-flip LANGFUSE_TRACING_ENABLED=true on the live Lambda.
Hit /team + /chat, measure latency. Should be <3s even if Langfuse is slow.
If good → follow-up PR promotes LANGFUSE_TRACING_ENABLED="true" in Terraform.
If bad → CLI-flip back to false (same kill-switch), file a follow-up.

Why this time will work (confidence check)

The research for #137 was docs-level and got the wrong answer. This time the research was source-level — agent traced the exact call chain in the actual installed langfuse==4.0.6 and OpenTelemetry SDK. The OTEL exporter source literally has a comment saying the timeout env var is ignored. LANGFUSE_TIMEOUT=2 is wired straight into the exporter ctor — verifiable by reading the SDK. The unbounded .join() path on _score_ingestion_queue is also visible in the SDK source.

The remaining risk is the background batch processor dropping spans when the Lambda freezes mid-batch. That's the accepted trade-off at our scale (1-2 rpm); if the Langfuse UI shows visible drop rate, ADOT is the next step.

Test plan

150 agent + lib tests pass locally
ruff + terraform validate clean
CI green
After deploy + CLI flip: /team returns in <3s even with tracing on
After deploy + CLI flip: a trace appears in the Langfuse UI for each request
After deploy + CLI flip: /chat streams end-to-end without handler hang

🤖 Generated with Claude Code

…emove explicit flush #137 re-enabled Langfuse with OTEL_BSP_EXPORT_TIMEOUT + OTEL_EXPORTER_OTLP _TIMEOUT, which are documented-ignored by the OTLP HTTP exporter (# Not used. No way currently to pass timeout to export.) and by Langfuse's own LangfuseSpanProcessor. Every /team request post-deploy hit Lambda's 60s timeout, forcing an emergency CLI flip of LANGFUSE_TRACING_ENABLED=false. Deep source-code research on langfuse-python v4.0.6 identified two separate blocking surfaces: 1. OTLPSpanExporter retry loop — 6 retries with exponential backoff (~63s total). Bounded only by the exporter's own `timeout=` ctor arg, which Langfuse wires to the LANGFUSE_TIMEOUT env var (seconds, default 5). That's the knob, not the OTEL_* ones. 2. resource_manager.flush() → _score_ingestion_queue.join() — no timeout. If we emit scores (/chat does, via _emit_quality_scores) and Langfuse is slow, .join() blocks indefinitely on the handler thread. This is unfixable without the SDK's upcoming v4.1 span_exporter= kwarg. Additionally, LANGFUSE_FLUSH_AT=1 was counterproductive: it sets max_export_batch_size=1, making BatchSpanProcessor.on_end() trigger synchronous export on the handler thread for every span — moving the blocking HTTP upload onto the user-facing request path, which is what the background thread exists to prevent. Fix: - lambda.tf: LANGFUSE_TIMEOUT=2 (caps retry loop); LANGFUSE_FLUSH_INTERVAL =1 (frequent background flush); remove OTEL_BSP_EXPORT_TIMEOUT, OTEL _EXPORTER_OTLP_TIMEOUT, LANGFUSE_FLUSH_AT. LANGFUSE_TRACING_ENABLED stays "false" in Terraform — flip to true via CLI after apply once production latency is verified. - api.py: remove three explicit langfuse_flush() calls from /chat and /chat/sync handlers. The background batch processor drains spans + scores on its own thread; explicit flush was the path through the unbounded _score_ingestion_queue.join(). Tests: two tests that mocked langfuse_flush() were deleted — they asserted behaviour we deliberately removed. 150 tests pass. ruff, terraform validate clean. Docs: ADR-0005 revision expanded with source-level citations (SDK line numbers) so the next person who touches this doesn't redo the research. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three rounds of env-var tuning (#133 / #137 / #138) failed to cap the request-path hang at <60s. Test on 2026-04-21 with LANGFUSE_TIMEOUT=2 confirmed live still hit Lambda's 60s timeout on /team; the actual blocking call was not root-caused. - Drop LANGFUSE_TIMEOUT and LANGFUSE_FLUSH_INTERVAL from lambda.tf — both are no-ops when tracing is disabled, keeping them was cargo. - Rewrite the comment next to LANGFUSE_TRACING_ENABLED="false" to reflect the parked decision rather than the stale "flip via CLI after apply" plan. Enrichment services retain tracing; they run in normal Lambda (no LWA, no streaming) and don't show this class of hang. Re-entry path is either a local reproduction with a debugger attached or switching to the ADOT Lambda Extension — not more env-var guessing. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ikuzuki merged commit 59ad906 into main Apr 20, 2026
9 checks passed

ikuzuki deleted the fix/langfuse-timeout-correction branch April 20, 2026 22:09

ikuzuki mentioned this pull request Apr 21, 2026

chore(infra): park Langfuse tracing on the agent Lambda #139

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): Langfuse 4.x flush — correct env vars + remove explicit flush#138

fix(agent): Langfuse 4.x flush — correct env vars + remove explicit flush#138
ikuzuki merged 1 commit intomainfrom
fix/langfuse-timeout-correction

ikuzuki commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikuzuki commented Apr 20, 2026

Summary

Root cause (source-code level, see ADR-0005 revision)

Fix

lambda.tf

api.py

Tests

Docs

Rollout plan

Why this time will work (confidence check)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant