Skip to content

Add Apptainer SWE-bench evaluation#743

Open
neubig wants to merge 2 commits into
mainfrom
add-swebench-apptainer-eval
Open

Add Apptainer SWE-bench evaluation#743
neubig wants to merge 2 commits into
mainfrom
add-swebench-apptainer-eval

Conversation

@neubig

@neubig neubig commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

Adds a local Apptainer evaluation path to swebench-eval:

  • new --apptainer flag for scoring SWE-bench predictions without Modal or Docker
  • reusable writable Apptainer sandboxes for official SWE-bench instance images
  • per-instance scoring artifacts and a report JSON containing resolved_ids for downstream Laminar score updates
  • README documentation for local Apptainer scoring
  • focused tests for empty-patch handling and report generation

Motivation

Some HPC environments have Apptainer available but do not have a Docker daemon/socket and may not have Modal auth configured. In those environments inference can run with Apptainer, but scoring previously required falling back to an ad hoc script.

Verification

  • python -m py_compile benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.py
  • uv run ruff check benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.py
  • uv run ruff format --check benchmarks/swebench/apptainer_eval.py benchmarks/swebench/eval_infer.py tests/test_swebench_eval_infer.py
  • /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest -q tests/test_swebench_eval_infer.py

Issue

Closes #747

This PR description update was created by an AI agent (Codex) on behalf of Graham Neubig.

all-hands-bot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Add Apptainer SWE-bench evaluation

🟡 Acceptable — Core functionality is sound, but there are blocking issues that must be fixed.


[CRITICAL ISSUES]

  • [benchmarks/swebench/apptainer_eval.py, Line 217] String Interpolation Bug: The f-string contains literal {APPLY_PATCH_FAIL} instead of the substituted constant value. This means the grading constant won't be written to test output, causing all patch failures to be misreported as timeout/success.

    Same issue at line 227 with {TESTS_TIMEOUT}.

    The shell script will literally echo the string "{APPLY_PATCH_FAIL}" instead of the actual constant value (likely something like "PATCH_FAILED" or similar). The SWE-bench grading logic depends on these constants being present in the test output to determine the result.

    Fix: Change echo "{APPLY_PATCH_FAIL}" to echo "{APPLY_PATCH_FAIL}" — but this requires checking how the f-string interpolation works with the nested curly braces. You may need {{ escaping, or better yet, format the string differently to ensure the constants are properly substituted.

  • [benchmarks/swebench/eval_infer.py, Line 326] Duplicate Variable Definition: dest_report_path is defined twice:

    • Line 326: dest_report_path = input_file.with_suffix(".report.json")
    • Line 363: dest_report_path = input_file.with_suffix(".report.json")

    This creates confusion and could lead to bugs if code is modified. Move the definition to the top of the if not args.skip_evaluation block to avoid duplication.


[IMPROVEMENT OPPORTUNITIES]

  • [benchmarks/swebench/eval_infer.py, Lines 325-366] Nesting Complexity: The conditional block has 3 levels of nesting (if not args.skip_evaluation:if args.apptainer:else:). This is at the edge of acceptable. Consider extracting the evaluation logic into a helper function to flatten the structure.

  • [benchmarks/swebench/apptainer_eval.py, Lines 199-234] Large Embedded Shell Script: The shell script is 35+ lines embedded in a Python string. While sometimes necessary, this makes testing and debugging harder. Consider:

    • Writing the script to a temporary file and executing it
    • Extracting the script into a separate file resource
    • Adding more granular error handling for shell-specific failures
  • [benchmarks/swebench/apptainer_eval.py, Line 238] Hardcoded Timeout Buffer: timeout=timeout_seconds + 300 adds a 5-minute buffer for Apptainer overhead. This magic number should be a named constant with a comment explaining why it's needed.


[TESTING GAPS]

  • No test for Apptainer binary not found: There's no test covering the case where apptainer command is not available. This could cause cryptic errors in production.

  • No test for image pull failure: The ensure_sandbox function handles build failures, but there's no test for network failures when pulling images.


[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

The string interpolation bug is a critical correctness issue that will cause incorrect grading results. All patches that fail to apply will not be properly marked as failures, leading to inflated pass rates. The other issues are quality concerns but not blockers.

Recommendation: Do not merge until the string interpolation bug is fixed. Verify the constants are being substituted correctly by adding a test or manual verification.


VERDICT:
Needs rework: The string interpolation bug must be fixed before merging.

KEY INSIGHT:
The f-string literal {APPLY_PATCH_FAIL} is not being substituted — the constants need to be properly interpolated into the shell script for grading to work correctly.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread benchmarks/swebench/apptainer_eval.py
Comment thread benchmarks/swebench/apptainer_eval.py
Comment thread benchmarks/swebench/eval_infer.py
Comment thread benchmarks/swebench/apptainer_eval.py
@neubig

neubig commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /iterate

@openhands-ai

openhands-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

@openhands-ai

openhands-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Summary

I addressed the review comments on PR #743 as follows:

1. Critical — {APPLY_PATCH_FAIL} literal string (line 217):
Fixed. Changed {{ echo "{APPLY_PATCH_FAIL}" to echo "{{{APPLY_PATCH_FAIL}}}". The {{ was producing a literal { in the shell script, preventing the Python f-string variable from being substituted.

2. Critical — {TESTS_TIMEOUT} literal string (line 227):
Fixed. Same approach — changed to echo "{{{TESTS_TIMEOUT}}}".

3. Important — duplicate dest_report_path (eval_infer.py line 326):
Not fixed. The variable is used both inside the if not args.skip_evaluation block and referenced at the end of the function for a None-check (if not args.skip_evaluation and dest_report_path is not None), so the initial = None declaration is necessary. This is not a bug.

4. Suggestion — magic number timeout_seconds + 300 (line 234):
Fixed. Added a named constant _APPTAINER_TIMEOUT_BUFFER = 300 with a comment explaining it accounts for Apptainer overhead (image extraction, sandbox init).

Status:

  • python -m py_compile passes
  • ruff check passes
  • ruff format --check passes
  • ✅ Changes staged, commit not yet pushed (was interrupted)

Co-authored-by: openhands <openhands@all-hands.dev>

neubig commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Addressed review feedback in d932fde and resolved the review threads. CI is now green; ready for another bot review.

This comment was created by an AI agent (OpenHands) on behalf of neubig.

@neubig neubig requested a review from all-hands-bot June 11, 2026 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track PR #743: Add Apptainer SWE-bench evaluation

3 participants