Skip to content

phungpx/LexiSignVQA

Repository files navigation

LexiSignVQA: A Unified Training-free Multi-stage Approach to Multimodal Legal Question Answering on Traffic Sign Rules


Abstract

The task of multimodal legal question answering on traffic sign rules (MLQA-TSR) presents unique challenges due to the need for jointly interpreting visual and textual information in regulatory contexts. In this work, we propose LexiSignVQA, a unified, training-free, multi-stage approach developed for the VLSP 2025 MLQA-TSR shared task. Our approach integrates traffic sign detection, image embedding, and vision–language modeling with a structured preprocessing procedure that aligns traffic sign images with their corresponding legal provisions.

By combining simple yet effective image processing for clean legal databases with traffic sign detection models for real-world scenarios, our method achieves both efficiency and robustness. Experimental results on the MLQA-TSR dataset demonstrate that LexiSignVQA ranked 1st in multimodal retrieval (Subtask 1) and 7th in legal question answering (Subtask 2). Furthermore, our analysis reveals the complementary strengths of conventional segmentation versus learning-based detection and highlights the role of embeddings in addressing directional reasoning in traffic signs. These findings underscore the potential of hybrid, training-free frameworks for advancing multimodal legal reasoning and practical applications in traffic law compliance.


Table of Contents


Task Description

The VLSP 2025 Multimodal Legal Question Answering on Traffic Sign Rules challenge consists of two interconnected subtasks:

Subtask 1: Relevant Article Retrieval

Identify relevant legal articles from Vietnamese traffic law documents given a query image and question.

  • Input: Query image containing traffic signs and a legal question
  • Output: Set of relevant legal articles from the corpus
  • Evaluation Metric: Precision, Recall, F2-Score

Subtask 2: Legal Question Answering

Answer multiple-choice and yes/no questions about traffic signs and regulations.

  • Input: Query image, question text, and (optionally) answer choices
  • Output: Predicted answer (A/B/C/D for multiple choice, Đúng/Sai for yes/no)
  • Evaluation Metric: Accuracy

Legal Document Corpus

The system processes two primary Vietnamese legal documents:

  1. QCVN 41:2024/BGTVT - Quy chuẩn kỹ thuật quốc gia về báo hiệu đường bộ (National Technical Regulation on Road Signs)
  2. Luật 36/2024/QH15 - Luật trật tự giao thông đường bộ (Road Traffic Order Law)

Architecture

LexiSignVQA is a unified, training-free, multi-stage framework that combines:

  • Traffic Sign Detection: YOLOE and GroundingDINO for detecting signs in real-world scenes
  • Image Embedding: SigLIP2 and CLIP models for semantic representation
  • Vision-Language Modeling: Gemma-3-12B for sign filtering and question answering
  • Vector Database: Qdrant for efficient retrieval of legal articles

The system processes both clean legal database images and real-world road scenes through distinct pipelines optimized for each context.


Methodology

Preprocessing

  • Legal Text: Convert HTML tables to Markdown, removing styling elements
  • Traffic Sign Extraction: Apply heuristic segmentation for clean LawDB images with white backgrounds
  • Sign Information: Use VLM to extract titles and descriptions, linking visual content to legal text
  • Embedding Storage: Generate image embeddings and store in vector database with law ID, article ID, and sign metadata

Legal Database Structure

Subtask 1: Multimodal Retrieval

  • Detect traffic signs in road scene images using YOLOE
  • Filter detected signs using VLM based on question relevance
  • Embed relevant signs and retrieve top-1 matching articles from vector database
  • Apply rule-based post-processing to remove duplicates

Subtask 2: Question Answering

  • Retrieve relevant articles using Subtask 1 pipeline
  • Incorporate traffic sign descriptions into VLM prompt for context
  • Generate multiple-choice or Yes/No answers using Gemma-3-12B

Subtask 1 Illustration

Installation

Prerequisites

  • Python 3.10 or higher
  • CUDA-compatible GPU (recommended for inference speed)
  • Docker and Docker Compose (for Qdrant vector database)

Environment Setup

  1. Clone the repository

    git clone git@github.com:phungpx/LexiSignVQA.git
    cd LexiSignVQA
  2. Create virtual environment

    # Using uv (recommended)
    uv venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
    # Or using standard Python
    python -m venv .venv
    source .venv/bin/activate
  3. Install dependencies

    uv pip install -e .
    # Or: pip install -e .
  4. Start Qdrant vector database

    docker-compose up -d

    Qdrant dashboard will be available at http://localhost:6333/dashboard

  5. Configure environment variables

    Create a .env file in the project root:

    # LLM Configuration
    LLM_API_KEYS=your_api_key_here
    LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
    LLM_MODEL=gemma-3-12b-it
    
    # Embedding Model
    EMBEDDING_NAME=CLIP-GmP-ViT-L-14
    
    # Detection Models
    IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing
    IMAGE_QUERY_DETECTION_MODEL=YoloE
    
    # Qdrant
    QDRANT_URI=http://localhost:6333
    
    # HuggingFace (optional, for gated models)
    HUGGINGFACE_TOKEN=your_hf_token_here

Optional: Local LLM Server Setup

For running Gemma-3-12B locally with llama.cpp:

  1. Build llama.cpp with CUDA support

    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
    cmake --build build --config Release -j -t llama-server
  2. Download model weights

    # Download from HuggingFace
    mkdir -p weights
    cd weights
    wget https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/gemma-3-12b-it-Q4_K_M.gguf
    wget https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf/resolve/main/mmproj-BF16.gguf
  3. Start the server

    CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-server \
      --model weights/gemma-3-12b-it-Q4_K_M.gguf \
      --mmproj weights/mmproj-BF16.gguf \
      --gpu-layers 99 \
      --seed 3407 \
      --ctx-size 5000 \
      --port 18989 \
      --flash-attn \
      --threads 16
  4. Update .env to use local server

    LLM_BASE_URL=http://localhost:18989/v1

Usage

Quick Start: Complete Pipeline

Run the entire pipeline using the Makefile:

# Setup (first time only)
make setup

# Process legal database
make lawdb

# Run both subtasks on training data
make task1-train
make task2-train

# Or run complete pipeline at once
make pipeline-train

Step-by-Step Execution

1. Legal Database Processing

Process the legal document corpus and build the vector database:

# Step 1: Extract article information
python -m src.core.lawdb.preprocess_lawdb_infos

# Step 2: Extract traffic sign images
python -m src.core.lawdb.extract_lawdb_sign_images

# Step 3: Parse sign metadata
python -m src.core.lawdb.parse_lawdb_sign_infos

# Step 4: Ingest into vector database
python -m src.core.lawdb.ingest_lawdb_signs

Or run all steps together:

bash src/core/lawdb/run_lawdb.sh

2. Subtask 1: Article Retrieval

Run article retrieval on different datasets:

# Training set
python -m src.core.sub_task_1 --set_name train --batch_size 10

# Public test set
python -m src.core.sub_task_1 --set_name public_test --batch_size 10

# Private test set
python -m src.core.sub_task_1 --set_name private_test --batch_size 10

Arguments:

  • --set_name: Dataset split (train, public_test, private_test)
  • --batch_size: Number of signs to process per LLM call (default: 10)
  • --begin_idx: Starting sample index for resuming (default: 0)
  • --log_level: Logging verbosity (DEBUG, INFO, WARNING, ERROR)

3. Subtask 2: Answer Generation

Generate answers for the questions:

# Training set
python -m src.core.sub_task_2 --set_name train

# Public test set
python -m src.core.sub_task_2 --set_name public_test

# Private test set
python -m src.core.sub_task_2 --set_name private_test

Output will be saved to:

  • Training: data/train_data/vlsp_2025_train_preprocessed_*.json
  • Public test: data/public_test/vlsp_2025_public_test_task1_preprocessed_*.json
  • Private test: data/private_test/submission_task2.json

Evaluation

Evaluate performance on the training set:

# Evaluate Subtask 1 (Article Retrieval)
python -m src.eval.sub_task_1

# Evaluate Subtask 2 (Answer Generation)
python -m src.eval.sub_task_2

Interactive UI Tools

Explore and debug the system using Streamlit interfaces:

# Inspect legal database
make ui-inspect
# Or: streamlit run src/ui/inspect_lawdb.py

# Review Subtask 1 predictions
make ui-inspect-task1
# Or: streamlit run src/ui/inspect_subtask1.py

# Annotation interface
make ui-label
# Or: streamlit run src/ui/label_app.py

Experimental Results

Subtask 1: Article Retrieval

We conducted ablation studies on detection method and embedding model combinations. All experiments use Gemma-3-12B-IT for sign filtering.

Exp LawDB Detection Query Detection Embedding Model Precision Recall F2-Score
1 GroundingDINO GroundingDINO SigLIP2-SO400M 0.5430 0.5463 0.5310
2 GroundingDINO YOLOE SigLIP2-SO400M 0.5235 0.5733 0.5445
3 ImageProcessing YOLOE SigLIP2-SO400M 0.5049 0.5543 0.5259
4 ImageProcessing YOLOE CLIP-GmP-ViT-L 0.5416 0.5959 0.5655

Key Findings:

  • Hybrid detection (ImageProcessing for clean legal images + YOLOE for real-world queries) achieves best balance
  • CLIP-GmP outperforms SigLIP2 despite smaller model size, suggesting better generalization to traffic signs
  • High recall is crucial for legal applications (measured by F2-score)

Subtask 2: Question Answering

Performance on the training set (530 samples) using the best Subtask 1 configuration:

Metric Value
Overall Accuracy 73.02% (387/530)
Multiple Choice Accuracy 73.14% (275/376)
Yes/No Accuracy 72.73% (112/154)

Per-Choice Breakdown:

Choice Count Correct Accuracy
A 115 87 75.65%
B 103 81 78.64%
C 84 60 71.43%
D 74 47 63.51%
Đúng 75 58 77.33%
Sai 79 54 68.35%

Observations:

  • Performance degrades for later choices (C, D), possibly due to question difficulty or model bias
  • Yes/No questions show comparable performance to multiple choice
  • Room for improvement through better prompt engineering and context assembly

Project Structure

LexiSignVQA/
│
├── data/                                   # Datasets and processed outputs
│   ├── law_db/                             # Legal document database
│   │   ├── vlsp2025_law_new.json           # Original legal corpus
│   │   ├── images.fld/                     # Article images
│   │   ├── signs_imageprocessing/          # Extracted signs (image processing)
│   │   ├── signs_groundingdino/            # Extracted signs (GroundingDINO)
│   │   └── *_preprocessed*.json            # Processed database files
│   │
│   ├── train_data/                         # Training dataset
│   │   ├── vlsp_2025_train.json            # Original training data
│   │   ├── train_images/                   # Training images
│   │   ├── train_signs_yoloe/              # Detected signs (YOLOE)
│   │   └── *_preprocessed*.json            # Preprocessed training outputs
│   │
│   ├── public_test/                        # Public test dataset
│   │   ├── vlsp_2025_public_test_task1.json
│   │   └── public_test_images/
│   │
│   └── private_test/                       # Private test dataset (for submission)
│       ├── vlsp2025_submission_task1.json
│       └── private_test_images/
│
├── src/                                    # Source code
│   ├── core/                               # Core pipeline modules
│   │   ├── lawdb/                          # Legal database processing
│   │   │   ├── preprocess_lawdb_infos.py   # Extract article information
│   │   │   ├── extract_lawdb_sign_images.py# Extract sign images
│   │   │   ├── parse_lawdb_sign_infos.py   # Parse sign metadata
│   │   │   ├── ingest_lawdb_signs.py       # Ingest into vector DB
│   │   │   └── run_lawdb.sh                # Complete pipeline script
│   │   │
│   │   ├── sub_task_1.py                   # Article retrieval pipeline
│   │   ├── sub_task_2.py                   # Answer generation pipeline
│   │   ├── extract_signs.py                # Traffic sign extraction
│   │   ├── filter_signs.py                 # LLM-based sign filtering
│   │   ├── query_signs.py                  # Vector similarity search
│   │   └── utils.py                        # Utility functions
│   │
│   ├── deps/                               # External dependencies
│   │   ├── detection/                      # Detection model wrappers
│   │   │   ├── base.py                     # Base detector interface
│   │   │   ├── groundingdino.py            # GroundingDINO wrapper
│   │   │   ├── yoloe.py                    # YOLOE wrapper
│   │   │   └── image_processing.py         # Classical CV methods
│   │   │
│   │   ├── embeddings.py                   # Vision encoder wrappers
│   │   ├── llm_client.py                   # LLM API client
│   │   └── qdrant.py                       # Vector database client
│   │
│   ├── prompts/                            # Prompt templates
│   │   ├── answer_prompt.py                # Answer generation prompts
│   │   ├── parse_signs_prompt.py           # Sign parsing prompts
│   │   └── sign_filter_prompt.py           # Sign filtering prompts
│   │
│   ├── eval/                               # Evaluation scripts
│   │   ├── sub_task_1.py                   # Subtask 1 evaluation
│   │   └── sub_task_2.py                   # Subtask 2 evaluation
│   │
│   ├── ui/                                 # Interactive interfaces
│   │   ├── inspect_lawdb.py                # Legal DB inspector
│   │   ├── inspect_subtask1.py             # Subtask 1 inspector
│   │   └── label_app.py                    # Annotation tool
│   │
│   ├── settings.py                         # Configuration management
│   └── constants.py                        # Global constants
│
├── models/                                 # Pre-trained model weights
│   └── yoloe-v8l-seg.pt                    # YOLOE-Large segmentation
│
├── notebook/                               # Jupyter notebooks
│   ├── explore.ipynb                       # Data exploration
│   ├── preprocess.ipynb                    # Preprocessing experiments
│   ├── embed.ipynb                         # Embedding analysis
│   └── yoloe.ipynb                         # Detection experiments
│
├── docs/                                   # Documentation
│   ├── papers/
│   │   ├── LexiSignVQA.pdf                 # Main paper
│   │   └── LexiSignVQA_Supplemental_materials.pdf
│   ├── subtask1.png                        # Task illustrations
│   ├── subtask2.png
│   └── lawdb.png
│
├── .env.example                            # Environment template
├── .gitignore                              # Git ignore patterns
├── docker-compose.yaml                     # Docker services definition
├── Makefile                                # Build automation
├── pyproject.toml                          # Python project metadata
└── README.md                               # This file

Configuration

System configuration is managed through src/settings.py and environment variables.

Environment Variables

See .env.example for a complete template. Key variables:

# LLM Configuration
LLM_API_KEYS=key1,key2,key3              # Comma-separated API keys (for rotation)
LLM_BASE_URL=https://api.provider.com/   # API endpoint
LLM_MODEL=gemma-3-12b-it                 # Model identifier
LLM_TEMPERATURE=0.0                      # Sampling temperature (0 = deterministic)
LLM_MAX_NEW_TOKENS=5000                  # Maximum output length

# Embedding Configuration
EMBEDDING_NAME=CLIP-GmP-ViT-L-14         # Vision encoder
# Options: CLIP-GmP-ViT-L-14, siglip2-so400m-patch14-384, 
#          siglip2-base-patch16-384, dinov2-with-registers-large

# Detection Configuration
IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing  # For clean legal images
IMAGE_QUERY_DETECTION_MODEL=YoloE            # For real-world query images
# Options: ImageProcessing, GroundingDINO, YoloE

# Vector Database
QDRANT_URI=http://localhost:6333
COLLECTION_NAME=auto                     # Auto-generated from other settings

# Retrieval Configuration
LIMIT=1                                  # Top-K articles to retrieve

# Optional: HuggingFace Token (for gated models)
HUGGINGFACE_TOKEN=your_token_here

Supported Models

Detection Models:

  • ImageProcessing: Classical CV (color/shape filtering) - best for clean images
  • GroundingDINO: Zero-shot object detection - balanced performance
  • YoloE: YOLOE-Large Segmentation - best for real-world images

Embedding Models:

  • CLIP-GmP-ViT-L-14: 768-dim, 427M params - recommended
  • siglip2-so400m-patch14-384: 1152-dim, 1.1B params
  • siglip2-base-patch16-384: 768-dim, 375M params
  • siglip2-large-patch16-384: 1024-dim, 882M params
  • dinov2-with-registers-large: 1024-dim - self-supervised

LLM Models:

  • gemma-3-12b-it: Google's instruction-tuned multimodal model - recommended
  • Compatible with any OpenAI-style API endpoint

Reproducibility

Hardware Requirements

All models were deployed on dual NVIDIA RTX 3060 GPUs, each with 12GB of memory, ensuring efficient inference and reliable reproducibility

See pyproject.toml for complete dependency list.

Reproducing Results

To exactly reproduce our VLSP 2025 submission results:

  1. Use the same configuration (Experiment 4 from results table):

    export LLM_MODEL=gemma-3-12b-it
    export IMAGE_LAWDB_DETECTION_MODEL=ImageProcessing
    export IMAGE_QUERY_DETECTION_MODEL=YoloE
    export EMBEDDING_NAME=CLIP-GmP-ViT-L-14
  2. Set deterministic seeds: The codebase uses fixed random seeds (e.g., seed=3407 for LLM server)

  3. Run the complete pipeline:

    make pipeline-train
  4. Evaluate:

    python -m src.eval.sub_task_1
    python -m src.eval.sub_task_2

Data Format Specification

Input Sample (data/train_data/vlsp_2025_train.json):

{
  "id": "train_001",
  "image_id": "IMG_001",
  "relevant_articles": [
    {
      "law_id": "QCVN 41:2024/BGTVT",
      "article_id": "22"
    }
  ],
  "question_type": "Multiple choice",
  "question": "Biển báo này có ý nghĩa gì?",
  "choices": [
    "A. Cấm rẽ trái",
    "B. Cấm rẽ phải", 
    "C. Cấm quay đầu xe",
    "D. Cấm dừng và đỗ xe"
  ],
  "answer": "A"
}

Preprocessed Output (includes detection and retrieval results):

{
  "id": "train_001",
  "image_id": "IMG_001",
  "detected_signs": [
    {
      "image_name": "IMG_001_sign_0.png",
      "bbox": [100, 200, 300, 400],
      "confidence": 0.95,
      "is_chosen": true
    }
  ],
  "retrieved_articles": [
    {
      "law_id": "QCVN 41:2024/BGTVT",
      "article_id": "22",
      "score": 0.89
    }
  ],
  "predict": "A",
  "answer_explanation": "Based on the retrieved article...",
  "time_second": 2.34
}

Citation

If you use LexiSignVQA in your research, please cite:

@inproceedings{lexisignvqa2025,
  title={LexiSignVQA: A Unified Training-free Multi-stage Approach to Multimodal Legal Question Answering on Traffic Sign Rules},
  author={Phung Xuan Pham, Duc Quang Le, Tuan Hau Tran, Thinh Nguyen-Truong Huynh},
  booktitle={Proceedings of INLG},
  year={2025},
}

Dataset License

Dataset License: The VLSP 2025 MLQA-TSR dataset is provided by the VLSP organizers. Please refer to the official challenge website for dataset usage terms.


About

LexiSignVQA: A Unified Training-free Multi-stage Approach to Multimodal Legal Question Answering on Traffic Sign Rules

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors