fix(openai): capture trailing streamed usage chunk after finish_reason#780
fix(openai): capture trailing streamed usage chunk after finish_reason#780valentin-ib wants to merge 2 commits into
Conversation
|
For context, I identified this whilst setting up a token and cost usage dashboard, and noticed that the usage wasn't coming through (it was always 0). So wanted to raise it and check if anyone else thinks this is a reasonable amendment. |
71c59b1 to
87e3551
Compare
Some OpenAI-compatible providers — notably Azure OpenAI and gateways like
LiteLLM — stream token usage in a SEPARATE chunk that arrives AFTER the
finish_reason chunk and just before [DONE], when stream_options.include_usage
is set:
data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}]}
data: {"choices":[{"index":0,"delta":{}}],"usage":{...}}
data: [DONE]
default_decode_stream_event flags the finish_reason chunk terminal?: true,
which finalizes the stream there. A consumer that reads Response.usage right
after the stream "completes" then races the still-in-flight usage chunk, so
input/output tokens (and any cost derived from them) come back as zero.
Non-streaming responses carry usage in the single body and are unaffected.
ChatAPI.decode_stream_event/2 strips the terminal flag off normal-completion
finish_reason chunks so the stream finalizes on [DONE] (or connection close)
instead — by which point the usage chunk has been accumulated. Inline error
chunks (finish_reason: :error, or any chunk carrying an :error key) keep their
terminal flag so failures still surface immediately. [DONE] and empty-choices
usage chunks have no :finish_reason key and are untouched.
Adds regression tests for the chunk ordering, the combined finish_reason+usage
chunk, empty-choices usage, and error-chunk termination.
87e3551 to
0d340dc
Compare
|
Thanks for the detailed repro. I dug into this against the latest PR head: after rebasing, the decoder-level claim in the original description no longer matches current I removed the redundant Could you please test the latest PR branch against your original Azure OpenAI / LiteLLM setup or token/cost dashboard repro? The simulated SSE path is covered and CI is green, but I want confirmation that the real provider path that produced zero usage is fixed before we treat #781 as fully closed. |
Description
With
stream_options: {include_usage: true}, Azure OpenAI and OpenAI-compatible gateways (e.g. LiteLLM) send tokenusagein a separate SSE chunk that arrives after thefinish_reasonchunk, just before[DONE]:default_decode_stream_event/2flags thefinish_reasonchunkterminal?: true, soStreamServerfinalizes the stream there. A consumer that readsReqLLM.Response.usage/1as soon as the stream "completes" then races the still-in-flight usage chunk and getsinput_tokens: 0 / output_tokens: 0(and therefore no cost). Non-streaming responses are unaffected — usage is in the single body. In practice this is deterministic: the usage chunk is a separate network frame, so the read almost always wins the race.Fix
ReqLLM.Providers.OpenAI.ChatAPI.decode_stream_event/2strips theterminal?flag off normal-completionfinish_reasonchunks, so the stream finalizes on[DONE](or connection close) instead — by which point the trailing usage chunk has been accumulated into metadata.Guards keep existing behavior intact:
finish_reason: :error(and any meta carrying an:errorkey — how OpenAI-compatible gateways report mid-stream failures viadata: {"error": ...}) keepterminal?, so failures still surface immediately instead of waiting for a[DONE]that won't come.[DONE]and empty-choicesusage chunks have no:finish_reasonkey, so they're untouched and keep their own terminal flag.Scoped to the OpenAI ChatAPI driver; non-streaming and other providers are unaffected.
Alternative considered
The deeper root cause is that
StreamServersnapshots metadata at the terminal chunk, before the trailing usage chunk merges. A fix there (defer finalization/metadata until the stream truly ends) would be more general but a larger change to the finalization lifecycle. This PR is the minimal, decoder-local fix — happy to take the broader approach instead if you'd prefer.Type of Change
Breaking Changes
None. The public API and the
Responsestruct are unchanged; streamedResponse.usageis simply populated where it was previously zero.Testing
mix test)mix format/ compile clean; CI runs fullmix quality)New unit tests in
test/providers/openai_chat_streaming_usage_test.exs(pure decode tests — no live API / fixtures needed):finish_reasonchunk is no longer terminalfinish_reason+usagechunk still yields usagechoicesusage chunk keeps its terminal flag[DONE]stays terminalChecklist
CHANGELOG.md(auto-generated by git_ops)Related Issues
Closes #781