Add Apptainer SWE-bench evaluation by neubig · Pull Request #743 · OpenHands/benchmarks

neubig · 2026-06-09T21:17:49Z

Summary

Adds a local Apptainer evaluation path to swebench-eval:

new --apptainer flag for scoring SWE-bench predictions without Modal or Docker
reusable writable Apptainer sandboxes for official SWE-bench instance images
per-instance scoring artifacts and a report JSON containing resolved_ids for downstream Laminar score updates
README documentation for local Apptainer scoring
focused tests for empty-patch handling and report generation

Motivation

Some HPC environments have Apptainer available but do not have a Docker daemon/socket and may not have Modal auth configured. In those environments inference can run with Apptainer, but scoring previously required falling back to an ad hoc script.

Verification

python -m py_compile benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.py
uv run ruff check benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.py
uv run ruff format --check benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.py
/home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest -q tests/test_swebench_eval_infer.py

Issue

Closes #747

This PR description update was created by an AI agent (Codex) on behalf of Graham Neubig.

all-hands-bot · 2026-06-09T21:23:40Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review: Add Apptainer SWE-bench evaluation

🟡 Acceptable — Core functionality is sound, but there are blocking issues that must be fixed.

[CRITICAL ISSUES]

[benchmarks/swebench/apptainer_eval.py, Line 217] String Interpolation Bug: The f-string contains literal {APPLY_PATCH_FAIL} instead of the substituted constant value. This means the grading constant won't be written to test output, causing all patch failures to be misreported as timeout/success.

Same issue at line 227 with {TESTS_TIMEOUT}.

The shell script will literally echo the string "{APPLY_PATCH_FAIL}" instead of the actual constant value (likely something like "PATCH_FAILED" or similar). The SWE-bench grading logic depends on these constants being present in the test output to determine the result.

Fix: Change echo "{APPLY_PATCH_FAIL}" to echo "{APPLY_PATCH_FAIL}" — but this requires checking how the f-string interpolation works with the nested curly braces. You may need {{ escaping, or better yet, format the string differently to ensure the constants are properly substituted.
[benchmarks/swebench/eval_infer.py, Line 326] Duplicate Variable Definition: dest_report_path is defined twice:
- Line 326: dest_report_path = input_file.with_suffix(".report.json")
- Line 363: dest_report_path = input_file.with_suffix(".report.json")
This creates confusion and could lead to bugs if code is modified. Move the definition to the top of the if not args.skip_evaluation block to avoid duplication.

[IMPROVEMENT OPPORTUNITIES]

[benchmarks/swebench/eval_infer.py, Lines 325-366] Nesting Complexity: The conditional block has 3 levels of nesting (if not args.skip_evaluation: → if args.apptainer: → else:). This is at the edge of acceptable. Consider extracting the evaluation logic into a helper function to flatten the structure.
[benchmarks/swebench/apptainer_eval.py, Lines 199-234] Large Embedded Shell Script: The shell script is 35+ lines embedded in a Python string. While sometimes necessary, this makes testing and debugging harder. Consider:
- Writing the script to a temporary file and executing it
- Extracting the script into a separate file resource
- Adding more granular error handling for shell-specific failures
[benchmarks/swebench/apptainer_eval.py, Line 238] Hardcoded Timeout Buffer: timeout=timeout_seconds + 300 adds a 5-minute buffer for Apptainer overhead. This magic number should be a named constant with a comment explaining why it's needed.

[TESTING GAPS]

No test for Apptainer binary not found: There's no test covering the case where apptainer command is not available. This could cause cryptic errors in production.
No test for image pull failure: The ensure_sandbox function handles build failures, but there's no test for network failures when pulling images.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

The string interpolation bug is a critical correctness issue that will cause incorrect grading results. All patches that fail to apply will not be properly marked as failures, leading to inflated pass rates. The other issues are quality concerns but not blockers.

Recommendation: Do not merge until the string interpolation bug is fixed. Verify the constants are being substituted correctly by adding a test or manual verification.

VERDICT:
❌ Needs rework: The string interpolation bug must be fixed before merging.

KEY INSIGHT:
The f-string literal {APPLY_PATCH_FAIL} is not being substituted — the constants need to be properly interpolated into the shell script for grading to work correctly.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

neubig · 2026-06-09T21:29:36Z

@OpenHands /iterate

openhands-ai · 2026-06-09T21:30:00Z

I'm on it! neubig can track my progress at all-hands.dev

openhands-ai · 2026-06-09T21:34:25Z

Summary

I addressed the review comments on PR #743 as follows:

1. Critical — {APPLY_PATCH_FAIL} literal string (line 217):
Fixed. Changed {{ echo "{APPLY_PATCH_FAIL}" to echo "{{{APPLY_PATCH_FAIL}}}". The {{ was producing a literal { in the shell script, preventing the Python f-string variable from being substituted.

2. Critical — {TESTS_TIMEOUT} literal string (line 227):
Fixed. Same approach — changed to echo "{{{TESTS_TIMEOUT}}}".

3. Important — duplicate dest_report_path (eval_infer.py line 326):
Not fixed. The variable is used both inside the if not args.skip_evaluation block and referenced at the end of the function for a None-check (if not args.skip_evaluation and dest_report_path is not None), so the initial = None declaration is necessary. This is not a bug.

4. Suggestion — magic number timeout_seconds + 300 (line 234):
Fixed. Added a named constant _APPTAINER_TIMEOUT_BUFFER = 300 with a comment explaining it accounts for Apptainer overhead (image extraction, sandbox init).

Status:

✅ python -m py_compile passes
✅ ruff check passes
✅ ruff format --check passes
✅ Changes staged, commit not yet pushed (was interrupted)

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-06-11T00:30:48Z

Addressed review feedback in d932fde and resolved the review threads. CI is now green; ready for another bot review.

This comment was created by an AI agent (OpenHands) on behalf of neubig.

Add Apptainer SWE-bench evaluation

06a8e87

neubig added the review-this label Jun 9, 2026

all-hands-bot requested changes Jun 9, 2026

View reviewed changes

Comment thread benchmarks/swebench/apptainer_eval.py

Comment thread benchmarks/swebench/apptainer_eval.py

Comment thread benchmarks/swebench/eval_infer.py

Comment thread benchmarks/swebench/apptainer_eval.py

neubig mentioned this pull request Jun 10, 2026

Track PR #743: Add Apptainer SWE-bench evaluation #747

Open

fix: satisfy apptainer evaluation checks

d932fde

Co-authored-by: openhands <openhands@all-hands.dev>

neubig requested a review from all-hands-bot June 11, 2026 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apptainer SWE-bench evaluation#743

Add Apptainer SWE-bench evaluation#743
neubig wants to merge 2 commits into
mainfrom
add-swebench-apptainer-eval

neubig commented Jun 9, 2026 •

edited

Loading

Uh oh!

all-hands-bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented Jun 9, 2026

Uh oh!

openhands-ai Bot commented Jun 9, 2026

Uh oh!

openhands-ai Bot commented Jun 9, 2026

Uh oh!

neubig commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Verification

Issue

Uh oh!

all-hands-bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review: Add Apptainer SWE-bench evaluation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented Jun 9, 2026

Uh oh!

openhands-ai Bot commented Jun 9, 2026

Uh oh!

openhands-ai Bot commented Jun 9, 2026

Summary

Uh oh!

neubig commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Jun 9, 2026 •

edited

Loading

all-hands-bot commented Jun 9, 2026 •

edited

Loading