Evaluate AI models ability to generate SVG code that meets specific visual requirements using automated rendering and LLM judge scoring
id
: Unique identifier for the test caseprompt
: Base textual description of what to createrequirements
: List of specific visual criteria that must be mettotal_requirements
: Number of requirements for scoring normalizationbase64
: For encoding rendered images for LLM judge evaluationlitellm
: For calling the GPT-4.1 vision model as LLM judgeselenium
: For automated SVG to PNG rendering (imported conditionally)pydantic
: For structured response validation from LLM judge@evaluation_test
decorator to configure the comprehensive evaluation:
input_dataset
: Path to SVG generation dataset JSONL filecompletion_params
: Multiple model configurations for comparisonpassed_threshold
: 50% average score required to pass evaluationmax_concurrent_rollouts
: Limits parallel processing for resource management