LexiSignVQA: A Unified Training-free Multi-stage Approach to Multimodal Legal Question Answering on Traffic Sign Rules

Paper | VLSP 2025 Challenge | Documentation | Dataset

Abstract

The task of multimodal legal question answering on traffic sign rules (MLQA-TSR) presents unique challenges due to the need for jointly interpreting visual and textual information in regulatory contexts. In this work, we propose LexiSignVQA, a unified, training-free, multi-stage approach developed for the VLSP 2025 MLQA-TSR shared task. Our approach integrates traffic sign detection, image embedding, and vision–language modeling with a structured preprocessing procedure that aligns traffic sign images with their corresponding legal provisions.

By combining simple yet effective image processing for clean legal databases with traffic sign detection models for real-world scenarios, our method achieves both efficiency and robustness. Experimental results on the MLQA-TSR dataset demonstrate that LexiSignVQA ranked 1st in multimodal retrieval (Subtask 1) and 7th in legal question answering (Subtask 2). Furthermore, our analysis reveals the complementary strengths of conventional segmentation versus learning-based detection and highlights the role of embeddings in addressing directional reasoning in traffic signs. These findings underscore the potential of hybrid, training-free frameworks for advancing multimodal legal reasoning and practical applications in traffic law compliance.

Task Description

The VLSP 2025 Multimodal Legal Question Answering on Traffic Sign Rules challenge consists of two interconnected subtasks:

Subtask 1: Relevant Article Retrieval

Identify relevant legal articles from Vietnamese traffic law documents given a query image and question.

Input: Query image containing traffic signs and a legal question
Output: Set of relevant legal articles from the corpus
Evaluation Metric: Precision, Recall, F2-Score

Subtask 2: Legal Question Answering

Answer multiple-choice and yes/no questions about traffic signs and regulations.

Input: Query image, question text, and (optionally) answer choices
Output: Predicted answer (A/B/C/D for multiple choice, Đúng/Sai for yes/no)
Evaluation Metric: Accuracy

Legal Document Corpus

The system processes two primary Vietnamese legal documents:

QCVN 41:2024/BGTVT - Quy chuẩn kỹ thuật quốc gia về báo hiệu đường bộ (National Technical Regulation on Road Signs)
Luật 36/2024/QH15 - Luật trật tự giao thông đường bộ (Road Traffic Order Law)

Architecture

LexiSignVQA is a unified, training-free, multi-stage framework that combines:

Traffic Sign Detection: YOLOE and GroundingDINO for detecting signs in real-world scenes
Image Embedding: SigLIP2 and CLIP models for semantic representation
Vision-Language Modeling: Gemma-3-12B for sign filtering and question answering
Vector Database: Qdrant for efficient retrieval of legal articles

The system processes both clean legal database images and real-world road scenes through distinct pipelines optimized for each context.

Methodology

Preprocessing

Legal Text: Convert HTML tables to Markdown, removing styling elements
Traffic Sign Extraction: Apply heuristic segmentation for clean LawDB images with white backgrounds
Sign Information: Use VLM to extract titles and descriptions, linking visual content to legal text
Embedding Storage: Generate image embeddings and store in vector database with law ID, article ID, and sign metadata

Subtask 1: Multimodal Retrieval

Detect traffic signs in road scene images using YOLOE
Filter detected signs using VLM based on question relevance
Embed relevant signs and retrieve top-1 matching articles from vector database
Apply rule-based post-processing to remove duplicates

Subtask 2: Question Answering

Retrieve relevant articles using Subtask 1 pipeline
Incorporate traffic sign descriptions into VLM prompt for context
Generate multiple-choice or Yes/No answers using Gemma-3-12B

Installation

Prerequisites

Python 3.10 or higher
CUDA-compatible GPU (recommended for inference speed)
Docker and Docker Compose (for Qdrant vector database)

Environment Setup

Clone the repository

git clone git@github.com:phungpx/LexiSignVQA.git
cd LexiSignVQA

Create virtual environment

# Using uv (recommended)
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Or using standard Python
python -m venv .venv
source .venv/bin/activate

Install dependencies

uv pip install -e .
# Or: pip install -e .

Start Qdrant vector database
```
docker-compose up -d
```
Qdrant dashboard will be available at http://localhost:6333/dashboard

Configure environment variables

Create a .env file in the project root:

# LLM Configuration
LLM_API_KEYS=your_api_key_here
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
LLM_MODEL=gemma-3-12b-it

# Embedding Model
EMBEDDING_NAME=CLIP-GmP-ViT-L-14

# Detection Models
IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing
IMAGE_QUERY_DETECTION_MODEL=YoloE

# Qdrant
QDRANT_URI=http://localhost:6333

# HuggingFace (optional, for gated models)
HUGGINGFACE_TOKEN=your_hf_token_here

Optional: Local LLM Server Setup

For running Gemma-3-12B locally with llama.cpp:

Build llama.cpp with CUDA support

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j -t llama-server

Download model weights

# Download from HuggingFace
mkdir -p weights
cd weights
wget https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/gemma-3-12b-it-Q4_K_M.gguf
wget https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/mmproj-BF16.gguf

Start the server

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-server \
  --model weights/gemma-3-12b-it-Q4_K_M.gguf \
  --mmproj weights/mmproj-BF16.gguf \
  --gpu-layers 99 \
  --seed 3407 \
  --ctx-size 5000 \
  --port 18989 \
  --flash-attn \
  --threads 16

Update .env to use local server
```
LLM_BASE_URL=http://localhost:18989/v1
```

Usage

Quick Start: Complete Pipeline

Run the entire pipeline using the Makefile:

# Setup (first time only)
make setup

# Process legal database
make lawdb

# Run both subtasks on training data
make task1-train
make task2-train

# Or run complete pipeline at once
make pipeline-train

Step-by-Step Execution

1. Legal Database Processing

Process the legal document corpus and build the vector database:

# Step 1: Extract article information
python -m src.core.lawdb.preprocess_lawdb_infos

# Step 2: Extract traffic sign images
python -m src.core.lawdb.extract_lawdb_sign_images

# Step 3: Parse sign metadata
python -m src.core.lawdb.parse_lawdb_sign_infos

# Step 4: Ingest into vector database
python -m src.core.lawdb.ingest_lawdb_signs

Or run all steps together:

bash src/core/lawdb/run_lawdb.sh

2. Subtask 1: Article Retrieval

Run article retrieval on different datasets:

# Training set
python -m src.core.sub_task_1 --set_name train --batch_size 10

# Public test set
python -m src.core.sub_task_1 --set_name public_test --batch_size 10

# Private test set
python -m src.core.sub_task_1 --set_name private_test --batch_size 10

Arguments:

--set_name: Dataset split (train, public_test, private_test)
--batch_size: Number of signs to process per LLM call (default: 10)
--begin_idx: Starting sample index for resuming (default: 0)
--log_level: Logging verbosity (DEBUG, INFO, WARNING, ERROR)

3. Subtask 2: Answer Generation

Generate answers for the questions:

# Training set
python -m src.core.sub_task_2 --set_name train

# Public test set
python -m src.core.sub_task_2 --set_name public_test

# Private test set
python -m src.core.sub_task_2 --set_name private_test

Output will be saved to:

Training: data/train_data/vlsp_2025_train_preprocessed_*.json
Public test: data/public_test/vlsp_2025_public_test_task1_preprocessed_*.json
Private test: data/private_test/submission_task2.json

Evaluation

Evaluate performance on the training set:

# Evaluate Subtask 1 (Article Retrieval)
python -m src.eval.sub_task_1

# Evaluate Subtask 2 (Answer Generation)
python -m src.eval.sub_task_2

Interactive UI Tools

Explore and debug the system using Streamlit interfaces:

# Inspect legal database
make ui-inspect
# Or: streamlit run src/ui/inspect_lawdb.py

# Review Subtask 1 predictions
make ui-inspect-task1
# Or: streamlit run src/ui/inspect_subtask1.py

# Annotation interface
make ui-label
# Or: streamlit run src/ui/label_app.py

Experimental Results

Subtask 1: Article Retrieval

We conducted ablation studies on detection method and embedding model combinations. All experiments use Gemma-3-12B-IT for sign filtering.

Exp	LawDB Detection	Query Detection	Embedding Model	Precision	Recall	F2-Score
1	GroundingDINO	GroundingDINO	SigLIP2-SO400M	0.5430	0.5463	0.5310
2	GroundingDINO	YOLOE	SigLIP2-SO400M	0.5235	0.5733	0.5445
3	ImageProcessing	YOLOE	SigLIP2-SO400M	0.5049	0.5543	0.5259
4	ImageProcessing	YOLOE	CLIP-GmP-ViT-L	0.5416	0.5959	0.5655

Key Findings:

Hybrid detection (ImageProcessing for clean legal images + YOLOE for real-world queries) achieves best balance
CLIP-GmP outperforms SigLIP2 despite smaller model size, suggesting better generalization to traffic signs
High recall is crucial for legal applications (measured by F2-score)

Subtask 2: Question Answering

Performance on the training set (530 samples) using the best Subtask 1 configuration:

Metric	Value
Overall Accuracy	73.02% (387/530)
Multiple Choice Accuracy	73.14% (275/376)
Yes/No Accuracy	72.73% (112/154)

Per-Choice Breakdown:

Choice	Count	Correct	Accuracy
A	115	87	75.65%
B	103	81	78.64%
C	84	60	71.43%
D	74	47	63.51%
Đúng	75	58	77.33%
Sai	79	54	68.35%

Observations:

Performance degrades for later choices (C, D), possibly due to question difficulty or model bias
Yes/No questions show comparable performance to multiple choice
Room for improvement through better prompt engineering and context assembly

Project Structure

LexiSignVQA/
│
├── data/                                   # Datasets and processed outputs
│   ├── law_db/                             # Legal document database
│   │   ├── vlsp2025_law_new.json           # Original legal corpus
│   │   ├── images.fld/                     # Article images
│   │   ├── signs_imageprocessing/          # Extracted signs (image processing)
│   │   ├── signs_groundingdino/            # Extracted signs (GroundingDINO)
│   │   └── *_preprocessed*.json            # Processed database files
│   │
│   ├── train_data/                         # Training dataset
│   │   ├── vlsp_2025_train.json            # Original training data
│   │   ├── train_images/                   # Training images
│   │   ├── train_signs_yoloe/              # Detected signs (YOLOE)
│   │   └── *_preprocessed*.json            # Preprocessed training outputs
│   │
│   ├── public_test/                        # Public test dataset
│   │   ├── vlsp_2025_public_test_task1.json
│   │   └── public_test_images/
│   │
│   └── private_test/                       # Private test dataset (for submission)
│       ├── vlsp2025_submission_task1.json
│       └── private_test_images/
│
├── src/                                    # Source code
│   ├── core/                               # Core pipeline modules
│   │   ├── lawdb/                          # Legal database processing
│   │   │   ├── preprocess_lawdb_infos.py   # Extract article information
│   │   │   ├── extract_lawdb_sign_images.py# Extract sign images
│   │   │   ├── parse_lawdb_sign_infos.py   # Parse sign metadata
│   │   │   ├── ingest_lawdb_signs.py       # Ingest into vector DB
│   │   │   └── run_lawdb.sh                # Complete pipeline script
│   │   │
│   │   ├── sub_task_1.py                   # Article retrieval pipeline
│   │   ├── sub_task_2.py                   # Answer generation pipeline
│   │   ├── extract_signs.py                # Traffic sign extraction
│   │   ├── filter_signs.py                 # LLM-based sign filtering
│   │   ├── query_signs.py                  # Vector similarity search
│   │   └── utils.py                        # Utility functions
│   │
│   ├── deps/                               # External dependencies
│   │   ├── detection/                      # Detection model wrappers
│   │   │   ├── base.py                     # Base detector interface
│   │   │   ├── groundingdino.py            # GroundingDINO wrapper
│   │   │   ├── yoloe.py                    # YOLOE wrapper
│   │   │   └── image_processing.py         # Classical CV methods
│   │   │
│   │   ├── embeddings.py                   # Vision encoder wrappers
│   │   ├── llm_client.py                   # LLM API client
│   │   └── qdrant.py                       # Vector database client
│   │
│   ├── prompts/                            # Prompt templates
│   │   ├── answer_prompt.py                # Answer generation prompts
│   │   ├── parse_signs_prompt.py           # Sign parsing prompts
│   │   └── sign_filter_prompt.py           # Sign filtering prompts
│   │
│   ├── eval/                               # Evaluation scripts
│   │   ├── sub_task_1.py                   # Subtask 1 evaluation
│   │   └── sub_task_2.py                   # Subtask 2 evaluation
│   │
│   ├── ui/                                 # Interactive interfaces
│   │   ├── inspect_lawdb.py                # Legal DB inspector
│   │   ├── inspect_subtask1.py             # Subtask 1 inspector
│   │   └── label_app.py                    # Annotation tool
│   │
│   ├── settings.py                         # Configuration management
│   └── constants.py                        # Global constants
│
├── models/                                 # Pre-trained model weights
│   └── yoloe-v8l-seg.pt                    # YOLOE-Large segmentation
│
├── notebook/                               # Jupyter notebooks
│   ├── explore.ipynb                       # Data exploration
│   ├── preprocess.ipynb                    # Preprocessing experiments
│   ├── embed.ipynb                         # Embedding analysis
│   └── yoloe.ipynb                         # Detection experiments
│
├── docs/                                   # Documentation
│   ├── papers/
│   │   ├── LexiSignVQA.pdf                 # Main paper
│   │   └── LexiSignVQA_Supplemental_materials.pdf
│   ├── subtask1.png                        # Task illustrations
│   ├── subtask2.png
│   └── lawdb.png
│
├── .env.example                            # Environment template
├── .gitignore                              # Git ignore patterns
├── docker-compose.yaml                     # Docker services definition
├── Makefile                                # Build automation
├── pyproject.toml                          # Python project metadata
└── README.md                               # This file

Configuration

System configuration is managed through src/settings.py and environment variables.

Environment Variables

See .env.example for a complete template. Key variables:

# LLM Configuration
LLM_API_KEYS=key1,key2,key3              # Comma-separated API keys (for rotation)
LLM_BASE_URL=https://api.provider.com/   # API endpoint
LLM_MODEL=gemma-3-12b-it                 # Model identifier
LLM_TEMPERATURE=0.0                      # Sampling temperature (0 = deterministic)
LLM_MAX_NEW_TOKENS=5000                  # Maximum output length

# Embedding Configuration
EMBEDDING_NAME=CLIP-GmP-ViT-L-14         # Vision encoder
# Options: CLIP-GmP-ViT-L-14, siglip2-so400m-patch14-384, 
#          siglip2-base-patch16-384, dinov2-with-registers-large

# Detection Configuration
IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing  # For clean legal images
IMAGE_QUERY_DETECTION_MODEL=YoloE            # For real-world query images
# Options: ImageProcessing, GroundingDINO, YoloE

# Vector Database
QDRANT_URI=http://localhost:6333
COLLECTION_NAME=auto                     # Auto-generated from other settings

# Retrieval Configuration
LIMIT=1                                  # Top-K articles to retrieve

# Optional: HuggingFace Token (for gated models)
HUGGINGFACE_TOKEN=your_token_here

Supported Models

Detection Models:

ImageProcessing: Classical CV (color/shape filtering) - best for clean images
GroundingDINO: Zero-shot object detection - balanced performance
YoloE: YOLOE-Large Segmentation - best for real-world images

Embedding Models:

CLIP-GmP-ViT-L-14: 768-dim, 427M params - recommended
siglip2-so400m-patch14-384: 1152-dim, 1.1B params
siglip2-base-patch16-384: 768-dim, 375M params
siglip2-large-patch16-384: 1024-dim, 882M params
dinov2-with-registers-large: 1024-dim - self-supervised

LLM Models:

gemma-3-12b-it: Google's instruction-tuned multimodal model - recommended
Compatible with any OpenAI-style API endpoint

Reproducibility

Hardware Requirements

All models were deployed on dual NVIDIA RTX 3060 GPUs, each with 12GB of memory, ensuring efficient inference and reliable reproducibility

See pyproject.toml for complete dependency list.

Reproducing Results

To exactly reproduce our VLSP 2025 submission results:

Use the same configuration (Experiment 4 from results table):

export LLM_MODEL=gemma-3-12b-it
export IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing
export IMAGE_QUERY_DETECTION_MODEL=YoloE
export EMBEDDING_NAME=CLIP-GmP-ViT-L-14

Set deterministic seeds: The codebase uses fixed random seeds (e.g., seed=3407 for LLM server)
Run the complete pipeline:
```
make pipeline-train
```

Evaluate:

python -m src.eval.sub_task_1
python -m src.eval.sub_task_2

Data Format Specification

Input Sample (data/train_data/vlsp_2025_train.json):

{
  "id": "train_001",
  "image_id": "IMG_001",
  "relevant_articles": [
    {
      "law_id": "QCVN 41:2024/BGTVT",
      "article_id": "22"
    }
  ],
  "question_type": "Multiple choice",
  "question": "Biển báo này có ý nghĩa gì?",
  "choices": [
    "A. Cấm rẽ trái",
    "B. Cấm rẽ phải", 
    "C. Cấm quay đầu xe",
    "D. Cấm dừng và đỗ xe"
  ],
  "answer": "A"
}

Preprocessed Output (includes detection and retrieval results):

{
  "id": "train_001",
  "image_id": "IMG_001",
  "detected_signs": [
    {
      "image_name": "IMG_001_sign_0.png",
      "bbox": [100, 200, 300, 400],
      "confidence": 0.95,
      "is_chosen": true
    }
  ],
  "retrieved_articles": [
    {
      "law_id": "QCVN 41:2024/BGTVT",
      "article_id": "22",
      "score": 0.89
    }
  ],
  "predict": "A",
  "answer_explanation": "Based on the retrieved article...",
  "time_second": 2.34
}

Citation

If you use LexiSignVQA in your research, please cite:

@inproceedings{lexisignvqa2025,
  title={LexiSignVQA: A Unified Training-free Multi-stage Approach to Multimodal Legal Question Answering on Traffic Sign Rules},
  author={Phung Xuan Pham, Duc Quang Le, Tuan Hau Tran, Thinh Nguyen-Truong Huynh},
  booktitle={Proceedings of INLG},
  year={2025},
}

Dataset License

Dataset License: The VLSP 2025 MLQA-TSR dataset is provided by the VLSP organizers. Please refer to the official challenge website for dataset usage terms.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
docs		docs
notebook		notebook
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

LexiSignVQA: A Unified Training-free Multi-stage Approach to Multimodal Legal Question Answering on Traffic Sign Rules

Abstract

Table of Contents

Task Description

Subtask 1: Relevant Article Retrieval

Subtask 2: Legal Question Answering

Legal Document Corpus

Architecture

Methodology

Preprocessing

Subtask 1: Multimodal Retrieval

Subtask 2: Question Answering

Installation

Prerequisites

Environment Setup

Optional: Local LLM Server Setup

Usage

Quick Start: Complete Pipeline

Step-by-Step Execution

1. Legal Database Processing

2. Subtask 1: Article Retrieval

3. Subtask 2: Answer Generation

Evaluation

Interactive UI Tools

Experimental Results

Subtask 1: Article Retrieval

Subtask 2: Question Answering

Project Structure

Configuration

Environment Variables

Supported Models

Reproducibility

Hardware Requirements

Reproducing Results

Data Format Specification

Citation

Dataset License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages