Skip to content

opendatalab/CiteVQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

arXiv Hugging Face dataset Dataset on ModelScope License MIT

๐Ÿ“– English ย |ย  ็ฎ€ไฝ“ไธญๆ–‡

If you like our project, please give us a star โญ on GitHub for the latest update.


๐Ÿ”Ž Overview

CiteVQA is a document visual question answering benchmark for faithful evidence attribution. Unlike conventional DocVQA datasets that only score the final answer, CiteVQA requires a model to answer a question with evidence grounded in the source document at the element level. The benchmark is designed to evaluate whether a system can not only answer correctly, but also cite the right supporting region in long, real-world PDFs.

The dataset contains 1,897 questions built from 711 PDFs across 7 macro-domains and 30 sub-domains, with an average of 40.6 pages per document. It covers both English and Chinese documents, and includes single-document as well as multi-document settings.

The evaluation covers three dataset types:

  • Single-Doc: Single-document question answering.
  • Multi (1-Gold): Multi-document QA with exactly one gold document.
  • Multi (N-Gold): Multi-document QA with multiple gold documents.

CiteVQA overview

Overview of CiteVQA. Left: a prediction is counted as correct only when the answer is correct and the cited evidence region is both relevant and spatially aligned with the gold evidence under Strictly Attributed Accuracy (SAA). Right top: dataset statistics show that CiteVQA emphasizes long, realistic PDFs. Right bottom: existing MLLMs exhibit a substantial gap between answer accuracy and evidence-grounded accuracy.

โœจ Highlights

  • Joint answer-and-evidence evaluation: Evaluates both answer correctness and citation faithfulness.
  • Element-level evidence: Structured gold evidence features bounding boxes, page, and document indices.
  • Long-document setting: Focuses on multi-page PDFs with realistic lengths and complex layouts.
  • Cross-domain and bilingual: Spans 7 domains, 30 sub-domains, and two languages (en, zh).
  • Multi-document reasoning: Features cross-document questions that require evidence aggregation.
  • Three evaluation settings: Supports Single-Doc, Multi (1-Gold), and Multi (N-Gold).

โš™๏ธ Setup

Install dependencies:

pip install -r requirements.txt

Optional CJK font configuration for PDF rendering:

Expand font setup for Chinese PDFs
apt install fonts-noto-cjk poppler-data

cat > /etc/fonts/conf.d/99-pdf-cjk.conf << 'EOF'
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
  <alias><family>STSong-Light</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>STSong</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>SimSun</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>FangSong</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>KaiTi</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>SimHei</family><prefer><family>Noto Sans CJK SC</family></prefer></alias>
  <alias><family>Microsoft YaHei</family><prefer><family>Noto Sans CJK SC</family></prefer></alias>
</fontconfig>
EOF

fc-cache -f

๐Ÿ“ฆ Data

From the repository root, you can fetch the benchmark files from Hugging Face into data/, then download the source PDFs:

pip install -U "huggingface_hub[cli]"
hf download opendatalab/CiteVQA --repo-type dataset --local-dir .
python data/download/download_pdfs.py --workers 16 --out data/pdf --csv data/download/pdf_source.csv

From the repository root, you can also fetch the benchmark files from ModelScope into data/, then download the source PDFs:

pip install -U modelscope
modelscope download --dataset OpenDataLab/CiteVQA --local_dir .
python data/download/download_pdfs.py --workers 16 --out data/pdf --csv data/download/pdf_source.csv

The PDF downloader reads data/download/pdf_source.csv and saves all files to data/pdf/.

If you run into dataset or download issues, jump to the Contact section.

Download Arguments
Option Default Description
--csv pdf_source.csv CSV file containing PDF URLs
--out pdf Output directory
--workers 16 Concurrent download workers
--timeout 120 Timeout per file in seconds
--retries 3 Retry count
--no-skip - Re-download existing files

๐Ÿš€ Inference and Evaluation

bash run.sh provides a demo for evaluating GPT-5.4. Edit the API settings in run.sh, then run:

bash run.sh

Reference workflow:

# API config
API_TYPE=openai
API_KEY=YOUR_API_KEY
BASE_URL=YOUR_BASE_URL

# Inference
python infer/run.py \
  --api ${API_TYPE} \
  --model MODEL_NAME \
  --base_url ${BASE_URL} \
  --api_key ${API_KEY} \
  --workers 4 \
  --out outputs/infer/MODEL_NAME.json

# Evaluation
python eval/run.py \
  --judge_api ${API_TYPE} \
  --judge_model JUDGE_MODEL_NAME \
  --judge_api_key ${API_KEY} \
  --base_url ${BASE_URL} \
  --input outputs/infer/MODEL_NAME.json \
  --out outputs/eval/MODEL_NAME.json \
  --workers 24

# Summary
python eval/summarize.py \
  --input outputs/eval/MODEL_NAME.json \
  --out_dir outputs/eval/MODEL_NAME

๐Ÿงญ Inference Arguments

Inference Arguments
Option Required Description
--api Yes openai, genai, or anthropic
--model Yes Model name
--api_key Yes API key
--base_url No API base URL
--workers No Number of workers, default 4
--out No Output JSON path
--benchmark No Benchmark path, default data/data_items.json
--limit No Sample limit, 0 means all
--max_pdf_mb No Compress PDFs larger than this size in MB

๐Ÿ“ Evaluation Arguments

Evaluation Arguments
Option Required Description
--input Yes Inference output JSON
--judge_api No Judge API type, default openai
--judge_model No Judge model name, default gpt-4o
--judge_api_key Yes Judge API key
--base_url No API base URL
--metrics No Metrics list, default recall,rel
--workers No Number of workers
--out No Output JSON path
--limit No Sample limit

๐Ÿ—‚๏ธ Repository Structure

CiteVQA/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ validation/
โ”‚   โ”‚   โ””โ”€โ”€ CiteVQA.json         # Benchmark QA pairs
โ”‚   โ”œโ”€โ”€ pdf/                     # Downloaded PDFs
โ”‚   โ””โ”€โ”€ download/
โ”‚       โ”œโ”€โ”€ pdf_source.csv       # PDF metadata & URLs
โ”‚       โ””โ”€โ”€ download_pdfs.py     # PDF download script
โ”œโ”€โ”€ infer/
โ”‚   โ””โ”€โ”€ run.py                   # Inference script
โ”œโ”€โ”€ eval/
โ”‚   โ”œโ”€โ”€ run.py                   # Evaluation script
โ”‚   โ””โ”€โ”€ summarize.py             # Summary table generator
โ”œโ”€โ”€ prompts/                     # System & user prompts
โ”œโ”€โ”€ outputs/                     # Inference & evaluation outputs
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ run.sh                       # Demo script

๐Ÿ“Š Evaluation Metrics

Metric Meaning
Recall Whether predicted evidence overlaps with crucial ground-truth evidence
Relevance (Rel.) Whether the cited evidence semantically supports the answer
Answer Correctness (Ans.) Whether the answer is correct
SAA Strict Attributed Accuracy: answer and evidence must both be valid
Page Recall Whether the correct page is identified
Precision / F1 Precision and overlap quality of predicted evidence

SAA is the core metric of CiteVQA.

๐Ÿ† Evaluation Result

We evaluated 20 state-of-the-art MLLMs on CiteVQA using a unified prompt template. The results show that faithful evidence attribution remains substantially harder than answer-only scoring.

  • Best overall SAA: Gemini-3.1-Pro-Preview reaches 76.0 SAA with 86.1 answer score.
  • Best answer accuracy: GPT-5.4 reaches 87.1 answer score, but its SAA drops to 59.0.
  • Best open-source model: Qwen3-VL-235B-A22B reaches 22.5 SAA with 72.3 answer score.
  • Key finding: a large gap between Ans. and SAA appears across models, highlighting the benchmark's Attribution Hallucination challenge.

Full overall results:

Model Category Rec. Rel. Ans. SAA
Gemini-3.1-Pro-Preview Closed-source MLLMs 66.0 83.6 86.1 76.0
Gemini-3-Flash-Preview Closed-source MLLMs 45.4 75.7 84.5 65.4
GPT-5.4 Closed-source MLLMs 31.0 67.5 87.1 59.0
Gemini-2.5-Pro Closed-source MLLMs 27.4 59.8 82.2 47.0
Seed2.0-Pro Closed-source MLLMs 28.5 54.9 81.3 44.1
GPT-5.2 Closed-source MLLMs 18.2 56.6 71.5 33.7
Qwen3.6-Plus Closed-source MLLMs 7.7 25.0 85.9 17.5
GLM-5V-Turbo Closed-source MLLMs 14.9 29.2 49.6 12.8
Qwen3-VL-235B-A22B Open-source Large MLLMs 11.3 35.3 72.3 22.5
Gemma-4-31B Open-source Large MLLMs 11.6 35.0 69.8 20.2
Kimi-K2.5 Open-source Large MLLMs 6.2 26.8 74.3 19.1
Qwen3.5-397B-A17B Open-source Large MLLMs 5.4 24.6 76.5 18.3
Qwen3.5-27B Open-source Large MLLMs 5.3 25.3 75.6 17.3
Qwen3-VL-32B Open-source Large MLLMs 6.6 30.5 72.3 17.3
Qwen3.5-122B-A10B Open-source Large MLLMs 3.9 19.0 73.6 14.8
Qwen3.5-9B Open-source Small MLLMs 1.6 14.7 65.0 11.1
Qwen3.5-35B-A3B Open-source Small MLLMs 1.7 13.7 76.4 10.7
Qwen3-VL-30B-A3B Open-source Small MLLMs 3.5 14.6 62.2 8.2
Qwen3-VL-8B Open-source Small MLLMs 1.0 14.7 61.2 7.5
Gemma-4-26B-A4B Open-source Small MLLMs 3.0 17.9 48.4 6.2

๐Ÿ“ฌ Contact

Since the PDF sources are downloaded from external links, issues such as broken links or data accessibility problems may occur during download. If you encounter any download-related problems, please email wzr@stu.pku.edu.cn.

๐Ÿ“š Citation

@article{ma2026citevqa,
  title={CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence},
  author={Ma, Dongsheng and Li, Jiayu and Wang, Zhengren and Wang, Yijie and Kong, Jiahao and Zeng, Weijun and Xiao, Jutao and Yang, Jie and Zhang, Wentao and Wang, Bin and He, Conghui},
  journal={arXiv preprint arXiv:2605.12882},
  year={2026}
}

๐Ÿ™ Acknowledgements

  • MinerU for document parsing.
  • ViDoRe V3 and other open-source datasets (SPIQA, MedQA, PubMedQA, MaintNorm, PolicyBench) for inspiring our benchmark construction.

๐Ÿ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.

ยฉ๏ธ Copyright Notice

CiteVQA is provided for academic research and non-commercial use only. We fully respect the rights of original copyright holders. If any rights holder believes that the inclusion, indexing, or use of any relevant content in this benchmark is inappropriate, please contact OpenDataLab@pjlab.org.cn. We will verify the request and remove or update the relevant content when appropriate.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors