VLM Benchmark Tool

A comprehensive benchmarking tool for Vision Language Models (VLMs) that allows you to test and compare multiple models across different hardware setups. This tool provides detailed analysis, quality scoring, and visualization capabilities for VLM performance evaluation.

Features

Multi-image Analysis: Process multiple images simultaneously with focused prompts
Model Comparison: Compare performance across different VLM models and hardware setups
Quality Scoring: Comprehensive scoring system based on detail coverage, quality indicators, and prompt relevance
Visualization: Automated generation of comparison charts and metrics
Hardware Efficiency: Analyze tokens/second, quality/second, and parameter efficiency
Flexible Configuration: Easy setup for different model endpoints and configurations

Installation

Clone or download this directory
Install the required dependencies:

pip install -r requirements.txt

Quick Start

Multi-Model Comparison (Primary Use Case)

The main purpose of this tool is benchmarking multiple VLM models/hardware setups:

from vlm_benchmark import benchmark_vlm

# Define your image URLs (replace with your own images)
image_urls = [
    "https://picsum.photos/800/600?random=1",
    "https://picsum.photos/800/600?random=2",
    "https://picsum.photos/800/600?random=3"
]

# Define model configurations for comparison
model_configs = [
    {
        "name": "gemma_27b_h100",
        "model_name": "gemma3-27b-it",
        "base_url": "http://server1:8000/v1/",
        "hardware": "H100"
    },
    {
        "name": "gemma_27b_amd",
        "model_name": "google/gemma-3-27b-it",
        "base_url": "http://server2:8000/v1/",
        "hardware": "AMD"
    }
]

# Run benchmark comparison (automatic visualization included)
results = benchmark_vlm(
    image_urls=image_urls,
    focus_prompt="walking person",
    model_configs=model_configs
)

Single Model Analysis (Optional)

For testing a single model setup:

# Single model configuration
single_model_config = {
    "name": "gemma_27b_test",
    "model_name": "gemma3-27b-it",
    "base_url": "http://localhost:8000/v1/",
    "hardware": "Local GPU"
}

# Run single model benchmark
results = benchmark_vlm(
    image_urls=image_urls,
    focus_prompt="walking person",
    single_model_config=single_model_config
)

Configuration

Model Configuration

Each model configuration requires:

name: Display name for the model (used in results and charts)
model_name: API model name (as expected by the server)
base_url: Base URL of the model server (must end with /v1/)
hardware: Optional hardware description for analysis

Image URLs

The tool accepts any publicly accessible image URLs. Images are automatically:

Downloaded using standard HTTP requests
Converted to RGB format if needed
Encoded to base64 for API transmission

Focus Prompts

Focus prompts guide the VLM to concentrate on specific aspects:

focus_prompt = "walking person"  # Look for people walking
focus_prompt = "red vehicles"    # Look for red cars/trucks
focus_prompt = "dogs playing"    # Look for dogs in playful activities

Quality Scoring System

The tool uses a comprehensive scoring algorithm:

Components

Detail Coverage (×2 points each)
- appearance, characteristics, behavior, movement, location
- positioning, context, color, size, shape
Quality Indicators (+1 point each)
- detailed, specific, clearly, precisely, accurately
- visible, evident, appears, shows, demonstrates
Uncertainty Markers (-1 point each)
- might, maybe, possibly, unclear, difficult to determine
- not sure, appears to be
Prompt Relevance (+1 point each)
- Each keyword from focus prompt mentioned in response

Formula

Quality Score = (Detail Categories × 2) + Quality Terms - Uncertainty Terms + Prompt Keywords

Analysis Metrics

The tool provides several performance metrics:

Processing Time: Time taken to generate response
Response Length: Character and word count
Quality Score: Comprehensive quality assessment
Hardware Efficiency: Tokens/second and quality/second ratios
Detail Coverage: Number of analysis categories covered
Prompt Relevance: How well the response addresses the focus prompt

Output Files

JSON Results

Detailed results are saved as timestamped JSON files:

{
  "timestamp": "2024-01-15T10:30:00",
  "image_urls": [...],
  "focus_prompt": "walking person",
  "models": {
    "model_name": {
      "response": "...",
      "processing_time": 2.34,
      "error": null,
      "hardware": "H100"
    }
  },
  "analysis": {
    "metrics": {...},
    "winner": "model_name"
  }
}

Visualizations

vlm_comparison_latest.png: Comparison charts with metrics
Processing time, response length, word count, and quality score comparisons
Hardware efficiency analysis
Embedded scoring methodology explanation

Server Requirements

Your VLM servers should:

Use OpenAI-compatible API format
Support vision inputs with base64 encoded images
Accept multimodal content arrays with text and image_url types
Run on accessible network endpoints

Troubleshooting

Common Issues

Image Download Failures
- Check image URLs are publicly accessible
- Verify network connectivity
- Ensure images are in supported formats (JPEG, PNG, etc.)
API Connection Errors
- Verify server URLs and ports
- Check model names match server configuration
- Ensure servers support vision capabilities
Memory Issues
- Reduce number of images per batch
- Use smaller image resolutions
- Monitor server memory usage

Debug Mode

Enable debug output by checking the console for:

Image processing status
API request details
Response validation
Error messages with specific failure reasons

Customization

Adding New Metrics

Extend the calculate_response_metrics function:

def calculate_response_metrics(response, focus_prompt=""):
    # Add your custom metrics here
    custom_score = your_scoring_function(response)

    return {
        # ... existing metrics
        "custom_score": custom_score
    }

Custom Visualizations

Create additional charts by extending visualize_comparison_results:

# Add new subplot
ax_new = plt.subplot(3, 3, 9)
ax_new.bar(model_names, custom_metrics)
ax_new.set_title('Custom Metric Comparison')

License

This tool is provided as-is for VLM benchmarking and research purposes.

Support

For issues and questions:

Check image URLs are accessible
Verify server configurations
Review error messages in console output
Ensure all dependencies are installed correctly

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt
vlm_benchmark.py		vlm_benchmark.py
vlm_comparison_20250930_095131.json		vlm_comparison_20250930_095131.json
vlm_comparison_20250930_100246.json		vlm_comparison_20250930_100246.json
vlm_comparison_latest.png		vlm_comparison_latest.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM Benchmark Tool

Features

Installation

Quick Start

Multi-Model Comparison (Primary Use Case)

Single Model Analysis (Optional)

Configuration

Model Configuration

Image URLs

Focus Prompts

Quality Scoring System

Components

Formula

Analysis Metrics

Output Files

JSON Results

Visualizations

Server Requirements

Troubleshooting

Common Issues

Debug Mode

Customization

Adding New Metrics

Custom Visualizations

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VLM Benchmark Tool

Features

Installation

Quick Start

Multi-Model Comparison (Primary Use Case)

Single Model Analysis (Optional)

Configuration

Model Configuration

Image URLs

Focus Prompts

Quality Scoring System

Components

Formula

Analysis Metrics

Output Files

JSON Results

Visualizations

Server Requirements

Troubleshooting

Common Issues

Debug Mode

Customization

Adding New Metrics

Custom Visualizations

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages