Add Apptainer SWE-bench evaluation#743
Conversation
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: Add Apptainer SWE-bench evaluation
🟡 Acceptable — Core functionality is sound, but there are blocking issues that must be fixed.
[CRITICAL ISSUES]
-
[benchmarks/swebench/apptainer_eval.py, Line 217] String Interpolation Bug: The f-string contains literal
{APPLY_PATCH_FAIL}instead of the substituted constant value. This means the grading constant won't be written to test output, causing all patch failures to be misreported as timeout/success.Same issue at line 227 with
{TESTS_TIMEOUT}.The shell script will literally echo the string
"{APPLY_PATCH_FAIL}"instead of the actual constant value (likely something like"PATCH_FAILED"or similar). The SWE-bench grading logic depends on these constants being present in the test output to determine the result.Fix: Change
echo "{APPLY_PATCH_FAIL}"toecho "{APPLY_PATCH_FAIL}"— but this requires checking how the f-string interpolation works with the nested curly braces. You may need{{escaping, or better yet, format the string differently to ensure the constants are properly substituted. -
[benchmarks/swebench/eval_infer.py, Line 326] Duplicate Variable Definition:
dest_report_pathis defined twice:- Line 326:
dest_report_path = input_file.with_suffix(".report.json") - Line 363:
dest_report_path = input_file.with_suffix(".report.json")
This creates confusion and could lead to bugs if code is modified. Move the definition to the top of the
if not args.skip_evaluationblock to avoid duplication. - Line 326:
[IMPROVEMENT OPPORTUNITIES]
-
[benchmarks/swebench/eval_infer.py, Lines 325-366] Nesting Complexity: The conditional block has 3 levels of nesting (
if not args.skip_evaluation:→if args.apptainer:→else:). This is at the edge of acceptable. Consider extracting the evaluation logic into a helper function to flatten the structure. -
[benchmarks/swebench/apptainer_eval.py, Lines 199-234] Large Embedded Shell Script: The shell script is 35+ lines embedded in a Python string. While sometimes necessary, this makes testing and debugging harder. Consider:
- Writing the script to a temporary file and executing it
- Extracting the script into a separate file resource
- Adding more granular error handling for shell-specific failures
-
[benchmarks/swebench/apptainer_eval.py, Line 238] Hardcoded Timeout Buffer:
timeout=timeout_seconds + 300adds a 5-minute buffer for Apptainer overhead. This magic number should be a named constant with a comment explaining why it's needed.
[TESTING GAPS]
-
No test for Apptainer binary not found: There's no test covering the case where
apptainercommand is not available. This could cause cryptic errors in production. -
No test for image pull failure: The
ensure_sandboxfunction handles build failures, but there's no test for network failures when pulling images.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
The string interpolation bug is a critical correctness issue that will cause incorrect grading results. All patches that fail to apply will not be properly marked as failures, leading to inflated pass rates. The other issues are quality concerns but not blockers.
Recommendation: Do not merge until the string interpolation bug is fixed. Verify the constants are being substituted correctly by adding a test or manual verification.
VERDICT:
❌ Needs rework: The string interpolation bug must be fixed before merging.
KEY INSIGHT:
The f-string literal {APPLY_PATCH_FAIL} is not being substituted — the constants need to be properly interpolated into the shell script for grading to work correctly.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
@OpenHands /iterate |
|
I'm on it! neubig can track my progress at all-hands.dev |
SummaryI addressed the review comments on PR #743 as follows: 1. Critical — 2. Critical — 3. Important — duplicate 4. Suggestion — magic number Status:
|
Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed review feedback in d932fde and resolved the review threads. CI is now green; ready for another bot review. This comment was created by an AI agent (OpenHands) on behalf of neubig. |
Summary
Adds a local Apptainer evaluation path to
swebench-eval:--apptainerflag for scoring SWE-bench predictions without Modal or Dockerresolved_idsfor downstream Laminar score updatesMotivation
Some HPC environments have Apptainer available but do not have a Docker daemon/socket and may not have Modal auth configured. In those environments inference can run with Apptainer, but scoring previously required falling back to an ad hoc script.
Verification
python -m py_compile benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.pyuv run ruff check benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.pyuv run ruff format --check benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.py/home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest -q tests/test_swebench_eval_infer.pyIssue
Closes #747
This PR description update was created by an AI agent (Codex) on behalf of Graham Neubig.