Skip to content

fix(sync): correct Phoenix span extraction for multi-span traces and tool-calling agents#273

Open
evduester wants to merge 1 commit into
mainfrom
fix/phoenix-span-extraction
Open

fix(sync): correct Phoenix span extraction for multi-span traces and tool-calling agents#273
evduester wants to merge 1 commit into
mainfrom
fix/phoenix-span-extraction

Conversation

@evduester

@evduester evduester commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fix 1 — trace-level dedup: Each agent run emits one Phoenix span per LLM call, with every subsequent span re-including all prior messages (cumulative context window). Processing all spans caused the same conversation steps to be analysed multiple times, inflating guideline counts. Now deduplicates at trace level and picks a single representative span per trace — the last by start_time, which has the most complete message history.

  • Fix 2 — complete message and tool extraction: The Phoenix REST API returns OpenInference attributes as flat indexed keys (llm.input_messages.{i}.message.*, llm.tools.{i}.tool.json_schema) rather than nested lists. The old extraction missed tool_call_id on tool messages and tool_calls on assistant messages.

    Before the fix, a tool-calling agent trajectory looked like this — the assistant call and tool results are completely unlinked:

    {"role": "assistant", "content": "None"},
    {"role": "tool", "content": "20"},
    {"role": "tool", "content": "25"}

    After the fix, the trajectory is complete and valid:

    {"role": "assistant", "tool_calls": [{"id": "call_abc", "type": "function", "function": {"name": "multiply", "arguments": "{\"a\": 10, \"b\": 2}"}}]},
    {"role": "tool", "tool_call_id": "call_abc", "content": "20"},
    {"role": "tool", "tool_call_id": "call_def", "content": "25"}

    Guidelines generated from the broken trajectories lacked tool-use context entirely. Fixed by adding an indexed-attribute reader for llm.input_messages.{i}.* / llm.output_messages.{i}.* (including nested tool_calls.{j}.*), _extract_tools_from_span to parse llm.tools.{i}.tool.json_schema, and _convert_openinference_tool_calls to convert OpenInference tool_call dicts to OpenAI format.

Test plan

  • uv run pytest tests/unit/test_phoenix_sync.py -v — all 46 tests pass
  • Run evolve sync phoenix against a Phoenix project with a tool-calling agent and verify the trajectory debug file shows complete tool_calls / tool_call_id fields and correct tool definitions
  • Verify skipped count in sync output reflects unique traces, not individual spans

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Bug Fixes

  • Improved trajectory processing and deduplication for enhanced efficiency.
  • Enhanced tool call extraction with support for multiple conventions.
  • Better message extraction with improved tool metadata handling.

…tool-calling agents

Two independent fixes to `PhoenixSync` that affect guideline generation accuracy.

Fix 1 — trace-level dedup and representative span selection:
Each agent run emits one Phoenix span per LLM call, with every subsequent span
re-including all prior messages (cumulative context window). Processing all spans
caused the same conversation steps to be analysed multiple times with growing
context, inflating guideline counts and skewing results. The fix deduplicates at
trace level (`_get_processed_trace_ids`) and picks a single representative span
per trace — the last by `start_time`, which holds the most complete message
history (`_select_representative_spans`).

Fix 2 — complete message and tool extraction from OpenInference spans:
The Phoenix REST API returns OpenInference attributes as flat indexed keys
(`llm.input_messages.{i}.message.*`, `llm.tools.{i}.tool.json_schema`) rather
than nested lists. The previous extraction code missed `tool_call_id` on tool
messages and `tool_calls` on assistant messages, producing incomplete trajectories
where tool-calling steps appeared as `content: "None"` with no linkage between
assistant calls and tool results. Guidelines generated from such trajectories
lacked tool-use context. The fix adds:
- `tool_call_id` extraction in both message loops
- an indexed-attribute reader for `llm.input_messages.{i}.*` and
  `llm.output_messages.{i}.*` (including nested `tool_calls.{j}.*`)
- `_extract_tools_from_span`: parses `llm.tools.{i}.tool.json_schema` and the
  list-mode `{"tool.json_schema": "..."}` format into OpenAI tool dicts
- `_convert_openinference_tool_calls`: converts OpenInference tool_call dicts to
  OpenAI format (`tool_call.function.name` → `function.name`, etc.)
- `_extract_trajectory`: propagates `tool_calls` and `tool_call_id` from the
  extracted message dict, choosing the right assembly path per message format

Unit tests updated: entity metadata mocks now include `trace_id` to match the
new trace-level dedup logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

PhoenixSync switches from span-level to trace-level deduplication, adding _get_processed_trace_ids() and _select_representative_spans(). Tool handling is expanded with _extract_tools_from_span() and _convert_openinference_tool_calls(), and tool_call_id/raw_tool_calls are propagated through message extraction and trajectory building. Tests are updated to include trace_id in processed-span mocks.

Changes

PhoenixSync trace deduplication and tool handling

Layer / File(s) Summary
Trace-level deduplication and sync() control flow
altk_evolve/sync/phoenix_sync.py, tests/unit/test_phoenix_sync.py
Adds _get_processed_trace_ids() and _select_representative_spans() to select the latest representative span per trace; reworks sync() to iterate only those spans while tracking skipped trace counts and updates logging. Test mocks updated to include trace_id alongside span_id.
Tool extraction and OpenInference conversion helpers
altk_evolve/sync/phoenix_sync.py
Adds _extract_tools_from_span() to retrieve tool schemas from three attribute formats (invocation params, llm.tools, indexed Phoenix flat keys) and _convert_openinference_tool_calls() to translate OpenInference tool call structures into OpenAI-compatible tool_calls.
Message extraction with tool_call_id and indexed messages
altk_evolve/sync/phoenix_sync.py
Extends prompt and completion message mapping to capture and persist tool_call_id. Adds indexed OpenInference message handling that conditionally attaches tool_calls and tool_call_id to each mapped prompt/completion.
Trajectory extraction: tool messages, assistant tool calls, and tools field
altk_evolve/sync/phoenix_sync.py
Updates _extract_trajectory() to plumb raw_tool_calls and tool_call_id, handle OpenInference tool messages, convert assistant tool calls, refine model name selection, and include extracted tools in the returned trajectory payload.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 Hops along the trace, not just one span,
Collecting tool calls the best that I can.
OpenInference? OpenAI? I translate with glee,
The representative span is chosen by me!
Each trace gets one hop — that's the plan! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the two main fixes: trace-level deduplication for multi-span traces and complete tool/message extraction for tool-calling agents.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/phoenix-span-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unit/test_phoenix_sync.py (1)

726-766: ⚖️ Poor tradeoff

Consider adding unit tests for new helper methods.

The existing TestGetProcessedSpanIds class tests _get_processed_span_ids(), but the new analogous _get_processed_trace_ids() method lacks dedicated tests. Similarly, _select_representative_spans(), _extract_tools_from_span(), and _convert_openinference_tool_calls() are only covered indirectly through sync() integration tests.

Adding targeted unit tests would improve coverage for edge cases like:

  • _select_representative_spans() with missing start_time or multiple spans per trace
  • _extract_tools_from_span() for each of the three attribute conventions
  • _convert_openinference_tool_calls() format conversion and passthrough

As per coding guidelines: "All new features need tests."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_phoenix_sync.py` around lines 726 - 766, Add dedicated unit
test classes for the new helper methods that currently lack direct test
coverage. Create a TestGetProcessedTraceIds class similar to the existing
TestGetProcessedSpanIds to test the _get_processed_trace_ids() method with
scenarios for empty results, existing entities, and namespace not found
exceptions. Add a TestSelectRepresentativeSpans class to test edge cases
including missing start_time attributes and multiple spans per trace. Add a
TestExtractToolsFromSpan class to test the method for each of the three
supported attribute conventions. Add a TestConvertOpeninferenceToolCalls class
to test format conversion logic and passthrough behavior. Ensure each test
method follows the existing naming conventions and includes appropriate
assertions and mocking setup.

Source: Coding guidelines

altk_evolve/sync/phoenix_sync.py (1)

529-544: 💤 Low value

Redundant JSON schema parsing in fallback block.

Lines 537-543 attempt to parse llm.tools.{i}.tool.json_schema again, but this same key was already read and parsed (or failed) at lines 521-527. If parsing succeeded, we already continued; if it failed or the key was absent, re-reading yields the same result.

Consider removing this redundant block or clarifying the intent if the fallback is meant to handle a different attribute.

♻️ Proposed fix
                 # Fall back to building from name/description/parameters parts
                 name = attrs.get(f"llm.tools.{i}.tool.name")
                 if not name:
                     continue
                 tool: dict = {"type": "function", "function": {"name": name}}
                 description = attrs.get(f"llm.tools.{i}.tool.description")
                 if description:
                     tool["function"]["description"] = description
-                json_schema = attrs.get(f"llm.tools.{i}.tool.json_schema")
-                if json_schema:
-                    try:
-                        schema = json.loads(json_schema) if isinstance(json_schema, str) else json_schema
-                        tool["function"]["parameters"] = schema
-                    except (json.JSONDecodeError, Exception):
-                        pass
+                # parameters could come from a separate .parameters key if schema parsing failed above
+                parameters_str = attrs.get(f"llm.tools.{i}.tool.parameters")
+                if parameters_str:
+                    try:
+                        tool["function"]["parameters"] = json.loads(parameters_str) if isinstance(parameters_str, str) else parameters_str
+                    except (json.JSONDecodeError, Exception):
+                        pass
                 tools.append(tool)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@altk_evolve/sync/phoenix_sync.py` around lines 529 - 544, The json_schema
parsing block in the fallback section (lines 537-543) is redundant because the
same llm.tools.{i}.tool.json_schema attribute was already read and parsed
earlier in the code flow at lines 521-527. Since a successful parse would have
already continued to the next iteration, re-attempting to parse the same
attribute in the fallback will only produce the same result. Remove the
redundant json_schema parsing block (the try-except block that reads and parses
json_schema and assigns it to tool["function"]["parameters"]) from the fallback
section to eliminate the duplicate logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@altk_evolve/sync/phoenix_sync.py`:
- Around line 529-544: The json_schema parsing block in the fallback section
(lines 537-543) is redundant because the same llm.tools.{i}.tool.json_schema
attribute was already read and parsed earlier in the code flow at lines 521-527.
Since a successful parse would have already continued to the next iteration,
re-attempting to parse the same attribute in the fallback will only produce the
same result. Remove the redundant json_schema parsing block (the try-except
block that reads and parses json_schema and assigns it to
tool["function"]["parameters"]) from the fallback section to eliminate the
duplicate logic.

In `@tests/unit/test_phoenix_sync.py`:
- Around line 726-766: Add dedicated unit test classes for the new helper
methods that currently lack direct test coverage. Create a
TestGetProcessedTraceIds class similar to the existing TestGetProcessedSpanIds
to test the _get_processed_trace_ids() method with scenarios for empty results,
existing entities, and namespace not found exceptions. Add a
TestSelectRepresentativeSpans class to test edge cases including missing
start_time attributes and multiple spans per trace. Add a
TestExtractToolsFromSpan class to test the method for each of the three
supported attribute conventions. Add a TestConvertOpeninferenceToolCalls class
to test format conversion logic and passthrough behavior. Ensure each test
method follows the existing naming conventions and includes appropriate
assertions and mocking setup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4fe9522f-7f2d-4812-bfe7-f592a2818e09

📥 Commits

Reviewing files that changed from the base of the PR and between 4d5b285 and e7e785f.

📒 Files selected for processing (2)
  • altk_evolve/sync/phoenix_sync.py
  • tests/unit/test_phoenix_sync.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant