LexiSignVQA: A Unified Training-free Multi-stage Approach to Multimodal Legal Question Answering on Traffic Sign Rules
The task of multimodal legal question answering on traffic sign rules (MLQA-TSR) presents unique challenges due to the need for jointly interpreting visual and textual information in regulatory contexts. In this work, we propose LexiSignVQA, a unified, training-free, multi-stage approach developed for the VLSP 2025 MLQA-TSR shared task. Our approach integrates traffic sign detection, image embedding, and vision–language modeling with a structured preprocessing procedure that aligns traffic sign images with their corresponding legal provisions.
By combining simple yet effective image processing for clean legal databases with traffic sign detection models for real-world scenarios, our method achieves both efficiency and robustness. Experimental results on the MLQA-TSR dataset demonstrate that LexiSignVQA ranked 1st in multimodal retrieval (Subtask 1) and 7th in legal question answering (Subtask 2). Furthermore, our analysis reveals the complementary strengths of conventional segmentation versus learning-based detection and highlights the role of embeddings in addressing directional reasoning in traffic signs. These findings underscore the potential of hybrid, training-free frameworks for advancing multimodal legal reasoning and practical applications in traffic law compliance.
- Task Description
- Installation
- Architecture
- Methodology
- Usage
- Experimental Results
- Project Structure
- Configuration
- Reproducibility
- Citation
- License
The VLSP 2025 Multimodal Legal Question Answering on Traffic Sign Rules challenge consists of two interconnected subtasks:
Identify relevant legal articles from Vietnamese traffic law documents given a query image and question.
- Input: Query image containing traffic signs and a legal question
- Output: Set of relevant legal articles from the corpus
- Evaluation Metric: Precision, Recall, F2-Score
Answer multiple-choice and yes/no questions about traffic signs and regulations.
- Input: Query image, question text, and (optionally) answer choices
- Output: Predicted answer (A/B/C/D for multiple choice, Đúng/Sai for yes/no)
- Evaluation Metric: Accuracy
The system processes two primary Vietnamese legal documents:
- QCVN 41:2024/BGTVT - Quy chuẩn kỹ thuật quốc gia về báo hiệu đường bộ (National Technical Regulation on Road Signs)
- Luật 36/2024/QH15 - Luật trật tự giao thông đường bộ (Road Traffic Order Law)
LexiSignVQA is a unified, training-free, multi-stage framework that combines:
- Traffic Sign Detection:
YOLOEandGroundingDINOfor detecting signs in real-world scenes - Image Embedding:
SigLIP2andCLIPmodels for semantic representation - Vision-Language Modeling:
Gemma-3-12Bfor sign filtering and question answering - Vector Database:
Qdrantfor efficient retrieval of legal articles
The system processes both clean legal database images and real-world road scenes through distinct pipelines optimized for each context.
- Legal Text: Convert HTML tables to Markdown, removing styling elements
- Traffic Sign Extraction: Apply heuristic segmentation for clean LawDB images with white backgrounds
- Sign Information: Use VLM to extract titles and descriptions, linking visual content to legal text
- Embedding Storage: Generate image embeddings and store in vector database with law ID, article ID, and sign metadata
- Detect traffic signs in road scene images using YOLOE
- Filter detected signs using VLM based on question relevance
- Embed relevant signs and retrieve top-1 matching articles from vector database
- Apply rule-based post-processing to remove duplicates
- Retrieve relevant articles using Subtask 1 pipeline
- Incorporate traffic sign descriptions into VLM prompt for context
- Generate multiple-choice or Yes/No answers using Gemma-3-12B
- Python 3.10 or higher
- CUDA-compatible GPU (recommended for inference speed)
- Docker and Docker Compose (for Qdrant vector database)
-
Clone the repository
git clone git@github.com:phungpx/LexiSignVQA.git cd LexiSignVQA -
Create virtual environment
# Using uv (recommended) uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # Or using standard Python python -m venv .venv source .venv/bin/activate
-
Install dependencies
uv pip install -e . # Or: pip install -e .
-
Start Qdrant vector database
docker-compose up -d
Qdrant dashboard will be available at
http://localhost:6333/dashboard -
Configure environment variables
Create a
.envfile in the project root:# LLM Configuration LLM_API_KEYS=your_api_key_here LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/ LLM_MODEL=gemma-3-12b-it # Embedding Model EMBEDDING_NAME=CLIP-GmP-ViT-L-14 # Detection Models IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing IMAGE_QUERY_DETECTION_MODEL=YoloE # Qdrant QDRANT_URI=http://localhost:6333 # HuggingFace (optional, for gated models) HUGGINGFACE_TOKEN=your_hf_token_here
For running Gemma-3-12B locally with llama.cpp:
-
Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF cmake --build build --config Release -j -t llama-server -
Download model weights
# Download from HuggingFace mkdir -p weights cd weights wget https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/gemma-3-12b-it-Q4_K_M.gguf wget https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/mmproj-BF16.gguf
-
Start the server
CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-server \ --model weights/gemma-3-12b-it-Q4_K_M.gguf \ --mmproj weights/mmproj-BF16.gguf \ --gpu-layers 99 \ --seed 3407 \ --ctx-size 5000 \ --port 18989 \ --flash-attn \ --threads 16
-
Update
.envto use local serverLLM_BASE_URL=http://localhost:18989/v1
Run the entire pipeline using the Makefile:
# Setup (first time only)
make setup
# Process legal database
make lawdb
# Run both subtasks on training data
make task1-train
make task2-train
# Or run complete pipeline at once
make pipeline-trainProcess the legal document corpus and build the vector database:
# Step 1: Extract article information
python -m src.core.lawdb.preprocess_lawdb_infos
# Step 2: Extract traffic sign images
python -m src.core.lawdb.extract_lawdb_sign_images
# Step 3: Parse sign metadata
python -m src.core.lawdb.parse_lawdb_sign_infos
# Step 4: Ingest into vector database
python -m src.core.lawdb.ingest_lawdb_signsOr run all steps together:
bash src/core/lawdb/run_lawdb.shRun article retrieval on different datasets:
# Training set
python -m src.core.sub_task_1 --set_name train --batch_size 10
# Public test set
python -m src.core.sub_task_1 --set_name public_test --batch_size 10
# Private test set
python -m src.core.sub_task_1 --set_name private_test --batch_size 10Arguments:
--set_name: Dataset split (train,public_test,private_test)--batch_size: Number of signs to process per LLM call (default: 10)--begin_idx: Starting sample index for resuming (default: 0)--log_level: Logging verbosity (DEBUG,INFO,WARNING,ERROR)
Generate answers for the questions:
# Training set
python -m src.core.sub_task_2 --set_name train
# Public test set
python -m src.core.sub_task_2 --set_name public_test
# Private test set
python -m src.core.sub_task_2 --set_name private_testOutput will be saved to:
- Training:
data/train_data/vlsp_2025_train_preprocessed_*.json - Public test:
data/public_test/vlsp_2025_public_test_task1_preprocessed_*.json - Private test:
data/private_test/submission_task2.json
Evaluate performance on the training set:
# Evaluate Subtask 1 (Article Retrieval)
python -m src.eval.sub_task_1
# Evaluate Subtask 2 (Answer Generation)
python -m src.eval.sub_task_2Explore and debug the system using Streamlit interfaces:
# Inspect legal database
make ui-inspect
# Or: streamlit run src/ui/inspect_lawdb.py
# Review Subtask 1 predictions
make ui-inspect-task1
# Or: streamlit run src/ui/inspect_subtask1.py
# Annotation interface
make ui-label
# Or: streamlit run src/ui/label_app.pyWe conducted ablation studies on detection method and embedding model combinations. All experiments use Gemma-3-12B-IT for sign filtering.
| Exp | LawDB Detection | Query Detection | Embedding Model | Precision | Recall | F2-Score |
|---|---|---|---|---|---|---|
| 1 | GroundingDINO | GroundingDINO | SigLIP2-SO400M | 0.5430 | 0.5463 | 0.5310 |
| 2 | GroundingDINO | YOLOE | SigLIP2-SO400M | 0.5235 | 0.5733 | 0.5445 |
| 3 | ImageProcessing | YOLOE | SigLIP2-SO400M | 0.5049 | 0.5543 | 0.5259 |
| 4 | ImageProcessing | YOLOE | CLIP-GmP-ViT-L | 0.5416 | 0.5959 | 0.5655 |
Key Findings:
- Hybrid detection (ImageProcessing for clean legal images + YOLOE for real-world queries) achieves best balance
- CLIP-GmP outperforms SigLIP2 despite smaller model size, suggesting better generalization to traffic signs
- High recall is crucial for legal applications (measured by F2-score)
Performance on the training set (530 samples) using the best Subtask 1 configuration:
| Metric | Value |
|---|---|
| Overall Accuracy | 73.02% (387/530) |
| Multiple Choice Accuracy | 73.14% (275/376) |
| Yes/No Accuracy | 72.73% (112/154) |
Per-Choice Breakdown:
| Choice | Count | Correct | Accuracy |
|---|---|---|---|
| A | 115 | 87 | 75.65% |
| B | 103 | 81 | 78.64% |
| C | 84 | 60 | 71.43% |
| D | 74 | 47 | 63.51% |
| Đúng | 75 | 58 | 77.33% |
| Sai | 79 | 54 | 68.35% |
Observations:
- Performance degrades for later choices (C, D), possibly due to question difficulty or model bias
- Yes/No questions show comparable performance to multiple choice
- Room for improvement through better prompt engineering and context assembly
LexiSignVQA/
│
├── data/ # Datasets and processed outputs
│ ├── law_db/ # Legal document database
│ │ ├── vlsp2025_law_new.json # Original legal corpus
│ │ ├── images.fld/ # Article images
│ │ ├── signs_imageprocessing/ # Extracted signs (image processing)
│ │ ├── signs_groundingdino/ # Extracted signs (GroundingDINO)
│ │ └── *_preprocessed*.json # Processed database files
│ │
│ ├── train_data/ # Training dataset
│ │ ├── vlsp_2025_train.json # Original training data
│ │ ├── train_images/ # Training images
│ │ ├── train_signs_yoloe/ # Detected signs (YOLOE)
│ │ └── *_preprocessed*.json # Preprocessed training outputs
│ │
│ ├── public_test/ # Public test dataset
│ │ ├── vlsp_2025_public_test_task1.json
│ │ └── public_test_images/
│ │
│ └── private_test/ # Private test dataset (for submission)
│ ├── vlsp2025_submission_task1.json
│ └── private_test_images/
│
├── src/ # Source code
│ ├── core/ # Core pipeline modules
│ │ ├── lawdb/ # Legal database processing
│ │ │ ├── preprocess_lawdb_infos.py # Extract article information
│ │ │ ├── extract_lawdb_sign_images.py# Extract sign images
│ │ │ ├── parse_lawdb_sign_infos.py # Parse sign metadata
│ │ │ ├── ingest_lawdb_signs.py # Ingest into vector DB
│ │ │ └── run_lawdb.sh # Complete pipeline script
│ │ │
│ │ ├── sub_task_1.py # Article retrieval pipeline
│ │ ├── sub_task_2.py # Answer generation pipeline
│ │ ├── extract_signs.py # Traffic sign extraction
│ │ ├── filter_signs.py # LLM-based sign filtering
│ │ ├── query_signs.py # Vector similarity search
│ │ └── utils.py # Utility functions
│ │
│ ├── deps/ # External dependencies
│ │ ├── detection/ # Detection model wrappers
│ │ │ ├── base.py # Base detector interface
│ │ │ ├── groundingdino.py # GroundingDINO wrapper
│ │ │ ├── yoloe.py # YOLOE wrapper
│ │ │ └── image_processing.py # Classical CV methods
│ │ │
│ │ ├── embeddings.py # Vision encoder wrappers
│ │ ├── llm_client.py # LLM API client
│ │ └── qdrant.py # Vector database client
│ │
│ ├── prompts/ # Prompt templates
│ │ ├── answer_prompt.py # Answer generation prompts
│ │ ├── parse_signs_prompt.py # Sign parsing prompts
│ │ └── sign_filter_prompt.py # Sign filtering prompts
│ │
│ ├── eval/ # Evaluation scripts
│ │ ├── sub_task_1.py # Subtask 1 evaluation
│ │ └── sub_task_2.py # Subtask 2 evaluation
│ │
│ ├── ui/ # Interactive interfaces
│ │ ├── inspect_lawdb.py # Legal DB inspector
│ │ ├── inspect_subtask1.py # Subtask 1 inspector
│ │ └── label_app.py # Annotation tool
│ │
│ ├── settings.py # Configuration management
│ └── constants.py # Global constants
│
├── models/ # Pre-trained model weights
│ └── yoloe-v8l-seg.pt # YOLOE-Large segmentation
│
├── notebook/ # Jupyter notebooks
│ ├── explore.ipynb # Data exploration
│ ├── preprocess.ipynb # Preprocessing experiments
│ ├── embed.ipynb # Embedding analysis
│ └── yoloe.ipynb # Detection experiments
│
├── docs/ # Documentation
│ ├── papers/
│ │ ├── LexiSignVQA.pdf # Main paper
│ │ └── LexiSignVQA_Supplemental_materials.pdf
│ ├── subtask1.png # Task illustrations
│ ├── subtask2.png
│ └── lawdb.png
│
├── .env.example # Environment template
├── .gitignore # Git ignore patterns
├── docker-compose.yaml # Docker services definition
├── Makefile # Build automation
├── pyproject.toml # Python project metadata
└── README.md # This file
System configuration is managed through src/settings.py and environment variables.
See .env.example for a complete template. Key variables:
# LLM Configuration
LLM_API_KEYS=key1,key2,key3 # Comma-separated API keys (for rotation)
LLM_BASE_URL=https://api.provider.com/ # API endpoint
LLM_MODEL=gemma-3-12b-it # Model identifier
LLM_TEMPERATURE=0.0 # Sampling temperature (0 = deterministic)
LLM_MAX_NEW_TOKENS=5000 # Maximum output length
# Embedding Configuration
EMBEDDING_NAME=CLIP-GmP-ViT-L-14 # Vision encoder
# Options: CLIP-GmP-ViT-L-14, siglip2-so400m-patch14-384,
# siglip2-base-patch16-384, dinov2-with-registers-large
# Detection Configuration
IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing # For clean legal images
IMAGE_QUERY_DETECTION_MODEL=YoloE # For real-world query images
# Options: ImageProcessing, GroundingDINO, YoloE
# Vector Database
QDRANT_URI=http://localhost:6333
COLLECTION_NAME=auto # Auto-generated from other settings
# Retrieval Configuration
LIMIT=1 # Top-K articles to retrieve
# Optional: HuggingFace Token (for gated models)
HUGGINGFACE_TOKEN=your_token_hereDetection Models:
ImageProcessing: Classical CV (color/shape filtering) - best for clean imagesGroundingDINO: Zero-shot object detection - balanced performanceYoloE: YOLOE-Large Segmentation - best for real-world images
Embedding Models:
CLIP-GmP-ViT-L-14: 768-dim, 427M params - recommendedsiglip2-so400m-patch14-384: 1152-dim, 1.1B paramssiglip2-base-patch16-384: 768-dim, 375M paramssiglip2-large-patch16-384: 1024-dim, 882M paramsdinov2-with-registers-large: 1024-dim - self-supervised
LLM Models:
gemma-3-12b-it: Google's instruction-tuned multimodal model - recommended- Compatible with any OpenAI-style API endpoint
All models were deployed on dual NVIDIA RTX 3060 GPUs, each with 12GB of memory, ensuring efficient inference and reliable reproducibility
See pyproject.toml for complete dependency list.
To exactly reproduce our VLSP 2025 submission results:
-
Use the same configuration (Experiment 4 from results table):
export LLM_MODEL=gemma-3-12b-it export IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing export IMAGE_QUERY_DETECTION_MODEL=YoloE export EMBEDDING_NAME=CLIP-GmP-ViT-L-14
-
Set deterministic seeds: The codebase uses fixed random seeds (e.g.,
seed=3407for LLM server) -
Run the complete pipeline:
make pipeline-train
-
Evaluate:
python -m src.eval.sub_task_1 python -m src.eval.sub_task_2
Input Sample (data/train_data/vlsp_2025_train.json):
{
"id": "train_001",
"image_id": "IMG_001",
"relevant_articles": [
{
"law_id": "QCVN 41:2024/BGTVT",
"article_id": "22"
}
],
"question_type": "Multiple choice",
"question": "Biển báo này có ý nghĩa gì?",
"choices": [
"A. Cấm rẽ trái",
"B. Cấm rẽ phải",
"C. Cấm quay đầu xe",
"D. Cấm dừng và đỗ xe"
],
"answer": "A"
}Preprocessed Output (includes detection and retrieval results):
{
"id": "train_001",
"image_id": "IMG_001",
"detected_signs": [
{
"image_name": "IMG_001_sign_0.png",
"bbox": [100, 200, 300, 400],
"confidence": 0.95,
"is_chosen": true
}
],
"retrieved_articles": [
{
"law_id": "QCVN 41:2024/BGTVT",
"article_id": "22",
"score": 0.89
}
],
"predict": "A",
"answer_explanation": "Based on the retrieved article...",
"time_second": 2.34
}If you use LexiSignVQA in your research, please cite:
@inproceedings{lexisignvqa2025,
title={LexiSignVQA: A Unified Training-free Multi-stage Approach to Multimodal Legal Question Answering on Traffic Sign Rules},
author={Phung Xuan Pham, Duc Quang Le, Tuan Hau Tran, Thinh Nguyen-Truong Huynh},
booktitle={Proceedings of INLG},
year={2025},
}Dataset License: The VLSP 2025 MLQA-TSR dataset is provided by the VLSP organizers. Please refer to the official challenge website for dataset usage terms.

