Skip to content
Open
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
055e4e6
Update software-agent-sdk submodule to main
openhands-agent Nov 7, 2025
a6ec978
initial commit, eval for code search
openhands-agent Nov 7, 2025
36fa267
Num runs should be managed by the user externally
adityasoni9998 Nov 7, 2025
7d3d360
Update software-agent-sdk submodule to main
adityasoni9998 Nov 7, 2025
5bf46dd
docker works
adityasoni9998 Nov 7, 2025
1fc3cac
example config for qwen3
adityasoni9998 Nov 7, 2025
5f74f63
local runtime works
adityasoni9998 Nov 7, 2025
5e2820d
use host network in agent sdk
adityasoni9998 Nov 9, 2025
bfe182a
add eval
adityasoni9998 Nov 10, 2025
72ef6ff
add eval
adityasoni9998 Nov 10, 2025
b891149
add analysis code
adityasoni9998 Nov 10, 2025
479c081
module-level rewards
adityasoni9998 Dec 4, 2025
86957d8
fine-grained rewards eval
adityasoni9998 Dec 8, 2025
fe75fb2
fine-grained rewards
adityasoni9998 Dec 8, 2025
64bb3ee
docker doesn't work but local does
adityasoni9998 Dec 8, 2025
db8e7bb
update README
adityasoni9998 Dec 8, 2025
6b92366
Merge branch 'main' into agentic_code_search
adityasoni9998 Dec 22, 2025
6d52715
revert to only allow local workspace in agentic code search
adityasoni9998 Dec 22, 2025
76b4a01
minor code bug fix
adityasoni9998 Dec 22, 2025
dea232c
Merge branch 'main' into agentic_code_search
adityasoni9998 Dec 29, 2025
a417dc6
Update software-agent-sdk submodule to match trainer
adityasoni9998 Dec 29, 2025
11ea94e
update parser config
adityasoni9998 Dec 29, 2025
7730bac
add dataset
adityasoni9998 Dec 29, 2025
160f527
Merge main into agentic_code_search and fix CI issues
openhands-agent Jan 8, 2026
39b8e0a
update agent-sdk
adityasoni9998 Jan 25, 2026
67dcc25
working checkpoint
adityasoni9998 Feb 23, 2026
9c1202e
prompt cleanup
adityasoni9998 Feb 23, 2026
65357c9
update eval code
adityasoni9998 Feb 23, 2026
a086441
cleanup code
adityasoni9998 Feb 23, 2026
a7198ae
rollout logic
adityasoni9998 Feb 23, 2026
482a100
add reminder logic to run_infer.py
adityasoni9998 Feb 23, 2026
c4eeee1
fix regression -- detect conversation ending properly
adityasoni9998 Feb 23, 2026
b449f20
polish metric computation
adityasoni9998 Feb 24, 2026
26fbbb4
minor update
adityasoni9998 Mar 18, 2026
6262db1
minor update
adityasoni9998 Mar 18, 2026
41717ea
Revise README with upcoming details notice
adityasoni9998 Mar 19, 2026
7cf83b8
Revise README for CodeScout evaluation setup
adityasoni9998 Mar 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions benchmarks/agentic_code_search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Agentic Code Search

Benchmarking code to evaluate LLMs on their ability to localize code from a python repository that requires editing to fix a given issue description in natural language

- NOTE: The JSONL file for the ground truth is prepared using [this code](https://github.com/adityasoni9998/LocAgent/blob/master/util/benchmark/gen_oracle_locations.py).
Empty file.
44 changes: 44 additions & 0 deletions benchmarks/agentic_code_search/eval_infer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import json
from argparse import ArgumentParser


def main(args):
results_file = args.results_file
f1_file = 0
f1_function = 0
f1_module = 0
num_steps = 0
num_tool_calls = 0
total_time = 0
cnt = 0
with open(results_file, "r") as f:
for line in f:
result = json.loads(line)
test_result = result["test_result"]
if "num_steps" in test_result:
num_steps += test_result["num_steps"]
if "num_tool_calls" in test_result:
num_tool_calls += test_result["num_tool_calls"]
if "wall_time_seconds" in test_result:
total_time += test_result["wall_time_seconds"]

reward_dict = result["test_result"]["reward"]
cnt += 1
if reward_dict is not None:
f1_file += reward_dict.get("file_reward", 0)
f1_module += reward_dict.get("module_reward", 0)
f1_function += reward_dict.get("entity_reward", 0)

print(f"Average File F1 score: {f1_file / cnt:.4f} over {cnt} samples")
print(f"Average Module F1 score: {f1_module / cnt:.4f} over {cnt} samples")
print(f"Average Function F1 score: {f1_function / cnt:.4f} over {cnt} samples")
print(f"Average # of steps: {num_steps / cnt:.4f} over {cnt} samples")
print(f"Average # of tool calls: {num_tool_calls / cnt:.4f} over {cnt} samples")
print(f"Average wall time (s): {total_time / cnt:.4f} over {cnt} samples")


if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("--results_file", type=str, required=True)
args = parser.parse_args()
main(args)
30 changes: 30 additions & 0 deletions benchmarks/agentic_code_search/prompts/file_module.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
I have access to a python code repository in the directory {{ working_dir }} . Consider the following issue description:

<issue_description>
{{ problem_statement }}
</issue_description>

Act as a code search agent and localize the specific files, classes or functions of code that need modification to resolve the issue in <issue_description>.

NOTE: You do not need to solve the issue, all you need to do is localize relevant code from the repository. Your output will be used to guide another agent to solve the issue.

Your final output should list the locations requiring modification, wrapped with triple backticks ```
Each location should include the file path, class name (if applicable), and function name. Here is an example Output:
```
full_path1/file1.py
class: MyClass1
function: my_function1

full_path2/file2.py
function: MyClass2.my_function2

full_path3/file3.py
function: my_function3
```

IMPORTANT: Your output MUST follow the below rules:
1. The final output must be returned in the message parameter of the Finish tool wrapped within ```, and there should be NO text outside these triple backticks (```).
2. The locations of the file path must be RELATIVE to the {{ working_dir }} directory WITHOUT any leading "./" in the output.
3. For each localized code output, you MUST always include the file path and the function name. If the function is within a class you MUST also include the class name.
4. Only include those locations in your output that need modification to resolve the issue in <issue_description>. Do NOT include any locations that do not need modification.

Loading
Loading