A comprehensive benchmarking tool for Vision Language Models (VLMs) that allows you to test and compare multiple models across different hardware setups. This tool provides detailed analysis, quality scoring, and visualization capabilities for VLM performance evaluation.
- Multi-image Analysis: Process multiple images simultaneously with focused prompts
- Model Comparison: Compare performance across different VLM models and hardware setups
- Quality Scoring: Comprehensive scoring system based on detail coverage, quality indicators, and prompt relevance
- Visualization: Automated generation of comparison charts and metrics
- Hardware Efficiency: Analyze tokens/second, quality/second, and parameter efficiency
- Flexible Configuration: Easy setup for different model endpoints and configurations
- Clone or download this directory
- Install the required dependencies:
pip install -r requirements.txtThe main purpose of this tool is benchmarking multiple VLM models/hardware setups:
from vlm_benchmark import benchmark_vlm
# Define your image URLs (replace with your own images)
image_urls = [
"https://picsum.photos/800/600?random=1",
"https://picsum.photos/800/600?random=2",
"https://picsum.photos/800/600?random=3"
]
# Define model configurations for comparison
model_configs = [
{
"name": "gemma_27b_h100",
"model_name": "gemma3-27b-it",
"base_url": "http://server1:8000/v1/",
"hardware": "H100"
},
{
"name": "gemma_27b_amd",
"model_name": "google/gemma-3-27b-it",
"base_url": "http://server2:8000/v1/",
"hardware": "AMD"
}
]
# Run benchmark comparison (automatic visualization included)
results = benchmark_vlm(
image_urls=image_urls,
focus_prompt="walking person",
model_configs=model_configs
)For testing a single model setup:
# Single model configuration
single_model_config = {
"name": "gemma_27b_test",
"model_name": "gemma3-27b-it",
"base_url": "http://localhost:8000/v1/",
"hardware": "Local GPU"
}
# Run single model benchmark
results = benchmark_vlm(
image_urls=image_urls,
focus_prompt="walking person",
single_model_config=single_model_config
)Each model configuration requires:
name: Display name for the model (used in results and charts)model_name: API model name (as expected by the server)base_url: Base URL of the model server (must end with/v1/)hardware: Optional hardware description for analysis
The tool accepts any publicly accessible image URLs. Images are automatically:
- Downloaded using standard HTTP requests
- Converted to RGB format if needed
- Encoded to base64 for API transmission
Focus prompts guide the VLM to concentrate on specific aspects:
focus_prompt = "walking person" # Look for people walking
focus_prompt = "red vehicles" # Look for red cars/trucks
focus_prompt = "dogs playing" # Look for dogs in playful activitiesThe tool uses a comprehensive scoring algorithm:
-
Detail Coverage (×2 points each)
- appearance, characteristics, behavior, movement, location
- positioning, context, color, size, shape
-
Quality Indicators (+1 point each)
- detailed, specific, clearly, precisely, accurately
- visible, evident, appears, shows, demonstrates
-
Uncertainty Markers (-1 point each)
- might, maybe, possibly, unclear, difficult to determine
- not sure, appears to be
-
Prompt Relevance (+1 point each)
- Each keyword from focus prompt mentioned in response
Quality Score = (Detail Categories × 2) + Quality Terms - Uncertainty Terms + Prompt Keywords
The tool provides several performance metrics:
- Processing Time: Time taken to generate response
- Response Length: Character and word count
- Quality Score: Comprehensive quality assessment
- Hardware Efficiency: Tokens/second and quality/second ratios
- Detail Coverage: Number of analysis categories covered
- Prompt Relevance: How well the response addresses the focus prompt
Detailed results are saved as timestamped JSON files:
{
"timestamp": "2024-01-15T10:30:00",
"image_urls": [...],
"focus_prompt": "walking person",
"models": {
"model_name": {
"response": "...",
"processing_time": 2.34,
"error": null,
"hardware": "H100"
}
},
"analysis": {
"metrics": {...},
"winner": "model_name"
}
}vlm_comparison_latest.png: Comparison charts with metrics- Processing time, response length, word count, and quality score comparisons
- Hardware efficiency analysis
- Embedded scoring methodology explanation
Your VLM servers should:
- Use OpenAI-compatible API format
- Support vision inputs with base64 encoded images
- Accept multimodal content arrays with text and image_url types
- Run on accessible network endpoints
-
Image Download Failures
- Check image URLs are publicly accessible
- Verify network connectivity
- Ensure images are in supported formats (JPEG, PNG, etc.)
-
API Connection Errors
- Verify server URLs and ports
- Check model names match server configuration
- Ensure servers support vision capabilities
-
Memory Issues
- Reduce number of images per batch
- Use smaller image resolutions
- Monitor server memory usage
Enable debug output by checking the console for:
- Image processing status
- API request details
- Response validation
- Error messages with specific failure reasons
Extend the calculate_response_metrics function:
def calculate_response_metrics(response, focus_prompt=""):
# Add your custom metrics here
custom_score = your_scoring_function(response)
return {
# ... existing metrics
"custom_score": custom_score
}Create additional charts by extending visualize_comparison_results:
# Add new subplot
ax_new = plt.subplot(3, 3, 9)
ax_new.bar(model_names, custom_metrics)
ax_new.set_title('Custom Metric Comparison')This tool is provided as-is for VLM benchmarking and research purposes.
For issues and questions:
- Check image URLs are accessible
- Verify server configurations
- Review error messages in console output
- Ensure all dependencies are installed correctly