fix(sync): correct Phoenix span extraction for multi-span traces and tool-calling agents by evduester · Pull Request #273 · AgentToolkit/altk-evolve

evduester · 2026-06-17T18:58:19Z

Summary

Fix 1 — trace-level dedup: Each agent run emits one Phoenix span per LLM call, with every subsequent span re-including all prior messages (cumulative context window). Processing all spans caused the same conversation steps to be analysed multiple times, inflating guideline counts. Now deduplicates at trace level and picks a single representative span per trace — the last by start_time, which has the most complete message history.
Fix 2 — complete message and tool extraction: The Phoenix REST API returns OpenInference attributes as flat indexed keys (llm.input_messages.{i}.message.*, llm.tools.{i}.tool.json_schema) rather than nested lists. The old extraction missed tool_call_id on tool messages and tool_calls on assistant messages.

Before the fix, a tool-calling agent trajectory looked like this — the assistant call and tool results are completely unlinked:
```
{"role": "assistant", "content": "None"},
{"role": "tool", "content": "20"},
{"role": "tool", "content": "25"}
```
After the fix, the trajectory is complete and valid:
```
{"role": "assistant", "tool_calls": [{"id": "call_abc", "type": "function", "function": {"name": "multiply", "arguments": "{\"a\": 10, \"b\": 2}"}}]},
{"role": "tool", "tool_call_id": "call_abc", "content": "20"},
{"role": "tool", "tool_call_id": "call_def", "content": "25"}
```
Guidelines generated from the broken trajectories lacked tool-use context entirely. Fixed by adding an indexed-attribute reader for llm.input_messages.{i}.* / llm.output_messages.{i}.* (including nested tool_calls.{j}.*), _extract_tools_from_span to parse llm.tools.{i}.tool.json_schema, and _convert_openinference_tool_calls to convert OpenInference tool_call dicts to OpenAI format.

Test plan

uv run pytest tests/unit/test_phoenix_sync.py -v — all 46 tests pass
Run evolve sync phoenix against a Phoenix project with a tool-calling agent and verify the trajectory debug file shows complete tool_calls / tool_call_id fields and correct tool definitions
Verify skipped count in sync output reflects unique traces, not individual spans

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Bug Fixes

Improved trajectory processing and deduplication for enhanced efficiency.
Enhanced tool call extraction with support for multiple conventions.
Better message extraction with improved tool metadata handling.

…tool-calling agents Two independent fixes to `PhoenixSync` that affect guideline generation accuracy. Fix 1 — trace-level dedup and representative span selection: Each agent run emits one Phoenix span per LLM call, with every subsequent span re-including all prior messages (cumulative context window). Processing all spans caused the same conversation steps to be analysed multiple times with growing context, inflating guideline counts and skewing results. The fix deduplicates at trace level (`_get_processed_trace_ids`) and picks a single representative span per trace — the last by `start_time`, which holds the most complete message history (`_select_representative_spans`). Fix 2 — complete message and tool extraction from OpenInference spans: The Phoenix REST API returns OpenInference attributes as flat indexed keys (`llm.input_messages.{i}.message.*`, `llm.tools.{i}.tool.json_schema`) rather than nested lists. The previous extraction code missed `tool_call_id` on tool messages and `tool_calls` on assistant messages, producing incomplete trajectories where tool-calling steps appeared as `content: "None"` with no linkage between assistant calls and tool results. Guidelines generated from such trajectories lacked tool-use context. The fix adds: - `tool_call_id` extraction in both message loops - an indexed-attribute reader for `llm.input_messages.{i}.*` and `llm.output_messages.{i}.*` (including nested `tool_calls.{j}.*`) - `_extract_tools_from_span`: parses `llm.tools.{i}.tool.json_schema` and the list-mode `{"tool.json_schema": "..."}` format into OpenAI tool dicts - `_convert_openinference_tool_calls`: converts OpenInference tool_call dicts to OpenAI format (`tool_call.function.name` → `function.name`, etc.) - `_extract_trajectory`: propagates `tool_calls` and `tool_call_id` from the extracted message dict, choosing the right assembly path per message format Unit tests updated: entity metadata mocks now include `trace_id` to match the new trace-level dedup logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-06-17T18:58:35Z

📝 Walkthrough

Walkthrough

PhoenixSync switches from span-level to trace-level deduplication, adding _get_processed_trace_ids() and _select_representative_spans(). Tool handling is expanded with _extract_tools_from_span() and _convert_openinference_tool_calls(), and tool_call_id/raw_tool_calls are propagated through message extraction and trajectory building. Tests are updated to include trace_id in processed-span mocks.

Changes

PhoenixSync trace deduplication and tool handling

Layer / File(s)	Summary
Trace-level deduplication and `sync()` control flow `altk_evolve/sync/phoenix_sync.py`, `tests/unit/test_phoenix_sync.py`	Adds `_get_processed_trace_ids()` and `_select_representative_spans()` to select the latest representative span per trace; reworks `sync()` to iterate only those spans while tracking skipped trace counts and updates logging. Test mocks updated to include `trace_id` alongside `span_id`.
Tool extraction and OpenInference conversion helpers `altk_evolve/sync/phoenix_sync.py`	Adds `_extract_tools_from_span()` to retrieve tool schemas from three attribute formats (invocation params, `llm.tools`, indexed Phoenix flat keys) and `_convert_openinference_tool_calls()` to translate OpenInference tool call structures into OpenAI-compatible `tool_calls`.
Message extraction with `tool_call_id` and indexed messages `altk_evolve/sync/phoenix_sync.py`	Extends prompt and completion message mapping to capture and persist `tool_call_id`. Adds indexed OpenInference message handling that conditionally attaches `tool_calls` and `tool_call_id` to each mapped prompt/completion.
Trajectory extraction: tool messages, assistant tool calls, and `tools` field `altk_evolve/sync/phoenix_sync.py`	Updates `_extract_trajectory()` to plumb `raw_tool_calls` and `tool_call_id`, handle OpenInference `tool` messages, convert assistant tool calls, refine model name selection, and include extracted `tools` in the returned trajectory payload.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 Hops along the trace, not just one span,
Collecting tool calls the best that I can.
OpenInference? OpenAI? I translate with glee,
The representative span is chosen by me!
Each trace gets one hop — that's the plan! 🌿

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly summarizes the two main fixes: trace-level deduplication for multi-span traces and complete tool/message extraction for tool-calling agents.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/phoenix-span-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

tests/unit/test_phoenix_sync.py (1)

726-766: ⚖️ Poor tradeoff

Consider adding unit tests for new helper methods.

The existing TestGetProcessedSpanIds class tests _get_processed_span_ids(), but the new analogous _get_processed_trace_ids() method lacks dedicated tests. Similarly, _select_representative_spans(), _extract_tools_from_span(), and _convert_openinference_tool_calls() are only covered indirectly through sync() integration tests.

Adding targeted unit tests would improve coverage for edge cases like:

_select_representative_spans() with missing start_time or multiple spans per trace

_extract_tools_from_span() for each of the three attribute conventions

_convert_openinference_tool_calls() format conversion and passthrough

As per coding guidelines: "All new features need tests."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_phoenix_sync.py` around lines 726 - 766, Add dedicated unit
test classes for the new helper methods that currently lack direct test
coverage. Create a TestGetProcessedTraceIds class similar to the existing
TestGetProcessedSpanIds to test the _get_processed_trace_ids() method with
scenarios for empty results, existing entities, and namespace not found
exceptions. Add a TestSelectRepresentativeSpans class to test edge cases
including missing start_time attributes and multiple spans per trace. Add a
TestExtractToolsFromSpan class to test the method for each of the three
supported attribute conventions. Add a TestConvertOpeninferenceToolCalls class
to test format conversion logic and passthrough behavior. Ensure each test
method follows the existing naming conventions and includes appropriate
assertions and mocking setup.
Source: Coding guidelines

altk_evolve/sync/phoenix_sync.py (1)

529-544: 💤 Low value

Redundant JSON schema parsing in fallback block.

Lines 537-543 attempt to parse llm.tools.{i}.tool.json_schema again, but this same key was already read and parsed (or failed) at lines 521-527. If parsing succeeded, we already continued; if it failed or the key was absent, re-reading yields the same result.

Consider removing this redundant block or clarifying the intent if the fallback is meant to handle a different attribute.

♻️ Proposed fix

                 # Fall back to building from name/description/parameters parts
                 name = attrs.get(f"llm.tools.{i}.tool.name")
                 if not name:
                     continue
                 tool: dict = {"type": "function", "function": {"name": name}}
                 description = attrs.get(f"llm.tools.{i}.tool.description")
                 if description:
                     tool["function"]["description"] = description
-                json_schema = attrs.get(f"llm.tools.{i}.tool.json_schema")
-                if json_schema:
-                    try:
-                        schema = json.loads(json_schema) if isinstance(json_schema, str) else json_schema
-                        tool["function"]["parameters"] = schema
-                    except (json.JSONDecodeError, Exception):
-                        pass
+                # parameters could come from a separate .parameters key if schema parsing failed above
+                parameters_str = attrs.get(f"llm.tools.{i}.tool.parameters")
+                if parameters_str:
+                    try:
+                        tool["function"]["parameters"] = json.loads(parameters_str) if isinstance(parameters_str, str) else parameters_str
+                    except (json.JSONDecodeError, Exception):
+                        pass
                 tools.append(tool)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@altk_evolve/sync/phoenix_sync.py` around lines 529 - 544, The json_schema
parsing block in the fallback section (lines 537-543) is redundant because the
same llm.tools.{i}.tool.json_schema attribute was already read and parsed
earlier in the code flow at lines 521-527. Since a successful parse would have
already continued to the next iteration, re-attempting to parse the same
attribute in the fallback will only produce the same result. Remove the
redundant json_schema parsing block (the try-except block that reads and parses
json_schema and assigns it to tool["function"]["parameters"]) from the fallback
section to eliminate the duplicate logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@altk_evolve/sync/phoenix_sync.py`:
- Around line 529-544: The json_schema parsing block in the fallback section
(lines 537-543) is redundant because the same llm.tools.{i}.tool.json_schema
attribute was already read and parsed earlier in the code flow at lines 521-527.
Since a successful parse would have already continued to the next iteration,
re-attempting to parse the same attribute in the fallback will only produce the
same result. Remove the redundant json_schema parsing block (the try-except
block that reads and parses json_schema and assigns it to
tool["function"]["parameters"]) from the fallback section to eliminate the
duplicate logic.

In `@tests/unit/test_phoenix_sync.py`:
- Around line 726-766: Add dedicated unit test classes for the new helper
methods that currently lack direct test coverage. Create a
TestGetProcessedTraceIds class similar to the existing TestGetProcessedSpanIds
to test the _get_processed_trace_ids() method with scenarios for empty results,
existing entities, and namespace not found exceptions. Add a
TestSelectRepresentativeSpans class to test edge cases including missing
start_time attributes and multiple spans per trace. Add a
TestExtractToolsFromSpan class to test the method for each of the three
supported attribute conventions. Add a TestConvertOpeninferenceToolCalls class
to test format conversion logic and passthrough behavior. Ensure each test
method follows the existing naming conventions and includes appropriate
assertions and mocking setup.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4fe9522f-7f2d-4812-bfe7-f592a2818e09

📥 Commits

Reviewing files that changed from the base of the PR and between 4d5b285 and e7e785f.

📒 Files selected for processing (2)

altk_evolve/sync/phoenix_sync.py
tests/unit/test_phoenix_sync.py

evduester requested review from gaodan-fang and jayaramkr June 17, 2026 18:59

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): correct Phoenix span extraction for multi-span traces and tool-calling agents#273

fix(sync): correct Phoenix span extraction for multi-span traces and tool-calling agents#273
evduester wants to merge 1 commit into
mainfrom
fix/phoenix-span-extraction

evduester commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evduester commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

evduester commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading