You can find the complete code for this example at test_svgbench.py.
Understanding SVG Generation Evaluation
SVG generation evaluation assesses a model’s ability to:- Interpret visual requirements: Understand textual descriptions of visual elements
- Generate valid SVG code: Create syntactically correct SVG markup
- Meet specific criteria: Fulfill detailed visual requirements like colors, shapes, positions
- Follow formatting conventions: Use proper SVG code block formatting
Understanding the Dataset Structure
The SVG generation dataset contains diverse test cases that evaluate different aspects of visual content creation, from simple geometric shapes to complex multi-element compositions.Dataset Format
Each entry in the dataset contains:id
: Unique identifier for the test caseprompt
: Base textual description of what to createrequirements
: List of specific visual criteria that must be mettotal_requirements
: Number of requirements for scoring normalization
Example Dataset Entry
Complex UI Recreation - Google Homepage:Dataset Characteristics
Requirement Categories:- Structural: Presence of specific shapes, elements, or text
- Aesthetic: Colors, proportions, visual balance, style consistency
- Technical: SVG formatting, dimensions, code validity
- Functional: Scalability, accessibility, professional appearance
- Automated rendering: SVG to PNG conversion using Selenium WebDriver
- LLM judge scoring: GPT-4.1 vision model evaluates requirement fulfillment
- Ratio-based scoring: Score = fulfilled_requirements / total_requirements
Step 1: Import Required Dependencies
First, we import the necessary modules for SVG evaluation:base64
: For encoding rendered images for LLM judge evaluationlitellm
: For calling the GPT-4.1 vision model as LLM judgeselenium
: For automated SVG to PNG rendering (imported conditionally)pydantic
: For structured response validation from LLM judge- Standard EP framework components for evaluation structure
Step 2: Create the Dataset Adapter
We need to convert the SVG dataset format to the EP’s expected format:- Formats visual requirements as a clear numbered list
- Provides SVG code block formatting instructions with examples
- Preserves original prompt and requirements for evaluation reference
- Creates structured metadata for scoring calculations
Step 3: Implement SVG Code Extraction
Extract SVG code from model responses with robust parsing:- Multiple parsing strategies: Handles both code blocks and raw SVG tags
- Fallback logic: Tries different extraction methods sequentially
- Robust extraction: Handles various formatting styles from different models
- Error handling: Returns None for invalid or missing SVG content
Step 4: Implement SVG to PNG Rendering
Convert SVG code to PNG images for visual evaluation:- Dimension detection: Extracts SVG dimensions from attributes or viewBox
- HTML wrapping: Creates proper HTML container with CSS styling
- Browser automation: Uses headless Chrome for consistent rendering
- Screenshot capture: Generates PNG image of rendered SVG
- Cleanup: Removes temporary files and browser instances
Step 5: Implement LLM Judge Evaluation
Use GPT-4.1 vision model to evaluate requirement fulfillment:- Vision analysis: Uses GPT-4.1’s multimodal capabilities to examine rendered images
- Structured evaluation: Provides clear requirements and expects JSON response
- Strict assessment: Instructs the judge to be thorough in requirement checking
- Response validation: Ensures proper JSON format and required fields
Step 6: Configure and Run the Evaluation
We use the@evaluation_test
decorator to configure the comprehensive evaluation:
input_dataset
: Path to SVG generation dataset JSONL filecompletion_params
: Multiple model configurations for comparisonpassed_threshold
: 50% average score required to pass evaluationmax_concurrent_rollouts
: Limits parallel processing for resource management
Evaluation Pipeline Explained
Complete Evaluation Flow
The SVG generation evaluation follows a comprehensive multi-stage pipeline:- Prompt Construction: Formats visual requirements with clear instructions
- SVG Generation: Model generates SVG code following specified format
- Code Extraction: Robust parsing extracts SVG from various response formats
- Visual Rendering: Selenium WebDriver converts SVG to PNG image
- LLM Judge Assessment: GPT-4.1 vision model evaluates requirement fulfillment
- Score Calculation: Ratio-based scoring provides normalized evaluation results