Add security benchmark with ASTRA#361
Conversation
|
@XZ-X thanks for the PR! I undestand it is a PR, just so you have in the radar, could you add a README that includes example commands to run? |
|
Hi @juanmichelini, sorry for the delay. I added the readme and tested my script for the latest branches of the benchmark repo. Thank you for your suggestions. |
|
@XZ-X thank you! At first glance I see some conventions issues could you fix the following:
|
juanmichelini
left a comment
There was a problem hiding this comment.
Please see comment above, happy answer any concerns and rereview when it is ready.
|
I committed a new fix. Thank you so much for your detailed feedback! |
✅ Review Complete - Excellent Work!I've tested the updated PR and you've addressed 6 out of 7 critical issues. The benchmark now follows the required architecture and is nearly ready to merge! ✅ What You Fixed (Great Job!)
❌ One Issue RemainingMissing:
✅ Testing Results (2 instances)I tested with 2 instances and everything works: Code Quality: AYour implementation is clean and professional:
Next Steps
Before → After
Outstanding work addressing all the feedback! Just need that one |
|
@XZ-X I need to do end to end integration test |
|
Thank you for your review! I added the |
✅ End-to-End Testing Complete - Excellent Work!I've now tested the updated PR end-to-end with a real LLM and you've addressed 6 out of 7 critical issues. The benchmark is fully functional! ✅ What You Fixed (Great Job!)
❌ One Issue RemainingMissing:
✅ Full End-to-End Test ResultsI tested with 2 real instances using Claude Sonnet 4.5: 1. run_infer.py - ✅ PASSEDuv run python benchmarks/astrasafety/run_infer.py \
.llm_config/sonnet-4-5.json \
--dataset test_2_instances.jsonl \
--output-dir test-output \
--num-workers 1 \
--max-iterations 10 \
--use-safety-analyzer \
--note "e2e-test"Result: ✅ Successfully processed both instances
2. eval_infer.py - ✅ PASSEDuv run python benchmarks/astrasafety/eval_infer.py \
--inference-results output.jsonl \
--dataset test_2_instances.jsonlResult: ✅ Successfully evaluated results Note: 0.00% detection rate means the security analyzer didn't flag HIGH/MEDIUM risks. This could indicate the agent properly refused the requests or that the analyzer needs tuning for this specific use case - but the important part is the pipeline works end-to-end! Code Quality: AYour implementation is production-ready:
Next Steps
Before → After
Outstanding work addressing all the feedback! Just need that one |
I think I added it in the latest commit 5f23e18 Thank you again for your time! |
We use ASTRA to generate a red-teaming dataset based on the security policy in the OpenHands coding agent. The dataset is publicly available at here.
This PR contains code for downloading, inferencing, and reporting performance the ASTRA dataset.