Benchmark for evaluating LLMs on corner case generation, code judgment, and debugging. This dataset was generated using GPT, Gemini, and Claude and should not be used to develop competing models.
📄 Paper: https://arxiv.org/abs/2603.15921
🤗 Dataset: https://huggingface.co/datasets/Salesforce/vibepass
# Install dependencies
pip install -r requirements.txt
# Set environment variables
cp .env.example .env
# Edit .env with your API keys
# Run evaluation
python src/eval.py \
--input data/benchmark.jsonl \
--output outputs/results.jsonl \
--model sonnet4.5 \
--task corner_case- OpenAI:
gpt-5-*(add_low,_medium,_high,_minimalfor effort) - Anthropic:
opus4.6,sonnet4.6,opus4.5,sonnet4.5,haiku4.5(add_thinking) - Gemini:
gemini-2.0-flash-exp,gemini-1.5-pro - Together AI: Various open-source models
- corner_case: Generate test cases that expose bugs in implementations
- judge: Evaluate whether a solution is correct or buggy
- debug: Fix buggy implementations
--input FILE # Input JSONL
--output FILE # Output JSONL
--model MODEL # Model name
--task TASK # corner_case, judge, or debug
--lcb_data PATH # LCB data (default: curation/data/lcb/test*.jsonl)
--timeout SECONDS # Default: 60
--num_process_generate # Default: 16
--num_process_evaluate # Default: 4OPENAI_API_KEY=...
OPENAI_BASE_URL=... # Optional
X_API_KEY=... # Optional gateway key
TOGETHER_API_KEY=...
GOOGLE_CLOUD_PROJECT=...
GOOGLE_CLOUD_LOCATION=global
SANDBOX_HOST=localhost
SANDBOX_PORT=8080{
"coca_id": "id",
"question_id": "platform_id",
"question_content": "Problem...",
"platform": "leetcode",
"buggy_model_solution": "def solution(): ...",
"test_checker": "def is_valid_test(): ...",
"starter_code": "class Solution: ..."
}Expects POST to http://localhost:8080/run_code:
{"code": "print('hello')", "language": "python", "run_timeout": 10}Returns:
{"status": "Success", "run_result": {"status": "Finished", "stdout": "hello\n"}}.
├── .env.example # Configuration template
├── .gitignore # Git ignore patterns
├── LICENSE # MIT License
├── README.md # This file
├── requirements.txt # Dependencies
└── src/
├── eval.py # Main evaluation script
├── llm_generator.py # LLM providers
├── utils.py # Utilities
└── prompts/ # Prompt templates
├── corner_case.py
├── judge.py
├── debug.py
└── codegen.py
@misc{bansal2026vibepassvibecodersreally,
title={VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?},
author={Srijan Bansal and Jiao Fangkai and Yilun Zhou and Austin Xu and Shafiq Joty and Semih Yavuz},
year={2026},
eprint={2603.15921},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2603.15921}
}