Skip to content

Eyshika/vlm_benchmark

Repository files navigation

VLM Benchmark Tool

A comprehensive benchmarking tool for Vision Language Models (VLMs) that allows you to test and compare multiple models across different hardware setups. This tool provides detailed analysis, quality scoring, and visualization capabilities for VLM performance evaluation.

Features

  • Multi-image Analysis: Process multiple images simultaneously with focused prompts
  • Model Comparison: Compare performance across different VLM models and hardware setups
  • Quality Scoring: Comprehensive scoring system based on detail coverage, quality indicators, and prompt relevance
  • Visualization: Automated generation of comparison charts and metrics
  • Hardware Efficiency: Analyze tokens/second, quality/second, and parameter efficiency
  • Flexible Configuration: Easy setup for different model endpoints and configurations

Installation

  1. Clone or download this directory
  2. Install the required dependencies:
pip install -r requirements.txt

Quick Start

Multi-Model Comparison (Primary Use Case)

The main purpose of this tool is benchmarking multiple VLM models/hardware setups:

from vlm_benchmark import benchmark_vlm

# Define your image URLs (replace with your own images)
image_urls = [
    "https://picsum.photos/800/600?random=1",
    "https://picsum.photos/800/600?random=2",
    "https://picsum.photos/800/600?random=3"
]

# Define model configurations for comparison
model_configs = [
    {
        "name": "gemma_27b_h100",
        "model_name": "gemma3-27b-it",
        "base_url": "http://server1:8000/v1/",
        "hardware": "H100"
    },
    {
        "name": "gemma_27b_amd",
        "model_name": "google/gemma-3-27b-it",
        "base_url": "http://server2:8000/v1/",
        "hardware": "AMD"
    }
]

# Run benchmark comparison (automatic visualization included)
results = benchmark_vlm(
    image_urls=image_urls,
    focus_prompt="walking person",
    model_configs=model_configs
)

Single Model Analysis (Optional)

For testing a single model setup:

# Single model configuration
single_model_config = {
    "name": "gemma_27b_test",
    "model_name": "gemma3-27b-it",
    "base_url": "http://localhost:8000/v1/",
    "hardware": "Local GPU"
}

# Run single model benchmark
results = benchmark_vlm(
    image_urls=image_urls,
    focus_prompt="walking person",
    single_model_config=single_model_config
)

Configuration

Model Configuration

Each model configuration requires:

  • name: Display name for the model (used in results and charts)
  • model_name: API model name (as expected by the server)
  • base_url: Base URL of the model server (must end with /v1/)
  • hardware: Optional hardware description for analysis

Image URLs

The tool accepts any publicly accessible image URLs. Images are automatically:

  • Downloaded using standard HTTP requests
  • Converted to RGB format if needed
  • Encoded to base64 for API transmission

Focus Prompts

Focus prompts guide the VLM to concentrate on specific aspects:

focus_prompt = "walking person"  # Look for people walking
focus_prompt = "red vehicles"    # Look for red cars/trucks
focus_prompt = "dogs playing"    # Look for dogs in playful activities

Quality Scoring System

The tool uses a comprehensive scoring algorithm:

Components

  1. Detail Coverage (×2 points each)

    • appearance, characteristics, behavior, movement, location
    • positioning, context, color, size, shape
  2. Quality Indicators (+1 point each)

    • detailed, specific, clearly, precisely, accurately
    • visible, evident, appears, shows, demonstrates
  3. Uncertainty Markers (-1 point each)

    • might, maybe, possibly, unclear, difficult to determine
    • not sure, appears to be
  4. Prompt Relevance (+1 point each)

    • Each keyword from focus prompt mentioned in response

Formula

Quality Score = (Detail Categories × 2) + Quality Terms - Uncertainty Terms + Prompt Keywords

Analysis Metrics

The tool provides several performance metrics:

  • Processing Time: Time taken to generate response
  • Response Length: Character and word count
  • Quality Score: Comprehensive quality assessment
  • Hardware Efficiency: Tokens/second and quality/second ratios
  • Detail Coverage: Number of analysis categories covered
  • Prompt Relevance: How well the response addresses the focus prompt

Output Files

JSON Results

Detailed results are saved as timestamped JSON files:

{
  "timestamp": "2024-01-15T10:30:00",
  "image_urls": [...],
  "focus_prompt": "walking person",
  "models": {
    "model_name": {
      "response": "...",
      "processing_time": 2.34,
      "error": null,
      "hardware": "H100"
    }
  },
  "analysis": {
    "metrics": {...},
    "winner": "model_name"
  }
}

Visualizations

  • vlm_comparison_latest.png: Comparison charts with metrics
  • Processing time, response length, word count, and quality score comparisons
  • Hardware efficiency analysis
  • Embedded scoring methodology explanation

Server Requirements

Your VLM servers should:

  1. Use OpenAI-compatible API format
  2. Support vision inputs with base64 encoded images
  3. Accept multimodal content arrays with text and image_url types
  4. Run on accessible network endpoints

Troubleshooting

Common Issues

  1. Image Download Failures

    • Check image URLs are publicly accessible
    • Verify network connectivity
    • Ensure images are in supported formats (JPEG, PNG, etc.)
  2. API Connection Errors

    • Verify server URLs and ports
    • Check model names match server configuration
    • Ensure servers support vision capabilities
  3. Memory Issues

    • Reduce number of images per batch
    • Use smaller image resolutions
    • Monitor server memory usage

Debug Mode

Enable debug output by checking the console for:

  • Image processing status
  • API request details
  • Response validation
  • Error messages with specific failure reasons

Customization

Adding New Metrics

Extend the calculate_response_metrics function:

def calculate_response_metrics(response, focus_prompt=""):
    # Add your custom metrics here
    custom_score = your_scoring_function(response)

    return {
        # ... existing metrics
        "custom_score": custom_score
    }

Custom Visualizations

Create additional charts by extending visualize_comparison_results:

# Add new subplot
ax_new = plt.subplot(3, 3, 9)
ax_new.bar(model_names, custom_metrics)
ax_new.set_title('Custom Metric Comparison')

License

This tool is provided as-is for VLM benchmarking and research purposes.

Support

For issues and questions:

  1. Check image URLs are accessible
  2. Verify server configurations
  3. Review error messages in console output
  4. Ensure all dependencies are installed correctly

About

A comprehensive benchmarking tool for Vision Language Models (VLMs) that allows you to test and compare multiple models across different hardware setups. This tool provides detailed analysis, quality scoring, and visualization capabilities for VLM performance evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages