This repository contains the code for our work on adaptive modality selection for spatial reasoning. The framework evaluates when a model should answer directly using language-based reasoning and when it should switch to a symbolic grid-based representation.
.
├── common/
│ ├── llm.py # vLLM and OpenAI-compatible clients
│ ├── io_utils.py # input/output utilities
│ ├── relations.py # relation extraction utilities
│ ├── metrics.py # evaluation and switching-policy analysis
│ └── switching/ # switching metrics and routing logic
│ ├── config.py # switching configuration
│ ├── complexity.py # complexity estimation
│ ├── trust.py # trustworthiness estimation
│ ├── shortcircuit.py # short-circuit routing for efficiency
│ └── thresholds.py # threshold selection
│
├── stepgame/
│ ├── switching/ # StepGame switching experiments
│ ├── grid_experiments/ # relation extraction and grid construction
│ └── shared/
│
├── spartun/
│ ├── switching/ # SpaRTUN switching experiments
│ └── grid_experiments/ # grid construction, relation extraction, and QA runners
│
├── resq/
│ └── grid_experiments/ # ReSQ grid-based reasoning pipeline
│
├── requirements.txt
└── README.md
Each dataset folder contains its own README.md with dataset-specific commands, flags, and experiment details.
The datasets are not redistributed in this repository. Please download them from the original releases and provide the corresponding paths to the runners using --data, --input, or the environment variables described below.
| Dataset | Source |
|---|---|
| StepGame | correct_clean split from https://github.com/Fangjun-Li/SpatialLM-StepGame |
| SpaRTUN & ReSQ | https://github.com/HLR/SpaRTUN |
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe experiments use the following environment variables when applicable:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
API key for OpenAI model calls |
VLLM_BASE_URL |
Local OpenAI-compatible vLLM endpoint |
VLLM_MODEL |
Served vLLM model identifier |
First tune the switching thresholds on the validation split. This saves thresholds.json in the specified output directory.
python -m stepgame.switching.run_switching \
--model qwen8b \
--split val \
--data /path/to/stepgame_reports.jsonl \
--out-dir runs/qwen8bThen evaluate on the test split using the validation thresholds. The test run uses the short-circuit cascade for efficiency.
python -m stepgame.switching.run_switching \
--model qwen8b \
--split test \
--data /path/to/stepgame_reports.jsonl \
--out-dir runs/qwen8bGrid experiments build the grid and run the reasoning modes (relation extraction → grid → text/relations/grid answers):
VLLM_MODEL=<served-model> PYTHONPATH=stepgame/shared \
python stepgame/grid_experiments/run_phase1.py samples.jsonl out.jsonlSpaRTUN follows the same validation-then-test protocol.
python -m spartun.switching.runner \
--input /path/to/spartun_switch_input.json \
--split all \
--out-dir runs/spartunGrid experiments first extract relations from the stories, then answer with the pruned grid:
# 1) extract relations
python spartun/grid_experiments/relation_extraction/extract_relations_pipeline.py \
--input stories.json --output relations.json --model <served-model>
# 2) grid QA
python spartun/grid_experiments/grid_qa_runners/run_pruned_grid.py \
--model <served-model> --input relations.json --output preds.jsonlReSQ runs the grid-based reasoning pipeline directly (no threshold tuning). Set the served model and run:
VLLM_MODEL=<served-model> python resq/grid_experiments/resq_md.pyGrids are built with GPT by default; set RESQ_GRID_BACKEND=vllm to build them on the served model instead.
If you use this code, please cite our paper. Citation information will be added after the arXiv version is available.