# null
Source: https://evalprotocol.io/community
Eval Protocol is an open standard for AI evaluation that helps developers build
better AI products through robust testing and iteration.
Most AI evaluation frameworks are proprietary or organization-specific, leading to:
* Duplicated evaluation code across teams
* Inconsistent benchmarking standards
* Limited access to proven evaluation methodologies
* Slow iteration cycles without community feedback
Our protocol standardizes AI evaluation, enabling you to:
* Share and reuse evaluation logic across projects
* Benchmark against established baselines
* Iterate faster with community-driven improvements
* Build reproducible evaluation pipelines
* Access evaluation tools used by production AI systems
Join [#eval-protocol](https://discord.com/channels/1137072072808472616/1400975572405850155) on Discord to discuss implementations, share evaluation strategies, and contribute to the standard.
# AIME 2025 (Open-Resource)
Source: https://evalprotocol.io/example/aime2025
Quick AIME-style math check using boxed final answers
This example wires up a lightweight AIME-style evaluation using the open `AIME2025` JSONL from Hugging Face. It is intended for quick model picking rather than a full reimplementation of the benchmark.
This example is now implemented as a suite in `eval_protocol/benchmarks/suites/aime25.py` and exported as `aime25`.
## What it does
* Pulls AIME2025 JSONL directly from Hugging Face.
* Prompts the model to reason and place the final answer inside `\\boxed{...}`.
* Parses the boxed value and compares it against ground truth for exact match scoring.
## How it’s configured
Key pieces in the SDK example:
* Dataset adapter converts raw rows with `question` and `answer` into `EvaluationRow`s.
* `@evaluation_test` provides URLs, model, and rollout parameters (including optional reasoning-effort variants).
* Evaluator extracts a final integer from the assistant message and checks equality with the ground truth.
## Run it locally
After installing eval-protocol, you can run the benchmark from anywhere:
```bash
pytest --pyargs eval_protocol.benchmarks.test_aime25 -v \
--ep-print-summary --ep-summary-json artifacts/aime25.json
```
Tip: use `--ep-max-rows=50` to limit dataset size, or `--ep-max-rows=all` for the full dataset. You can also use `--ep-reasoning-effort=high` and `--ep-input-param temperature=0.0` to adjust model settings.
## Notes
* This is a convenience wrapper for model selection, not a canonical reproduction of AIME.
* The evaluation is strict exact match over a parsed integer from `\\boxed{...}`.
# APPS Coding Evaluation
Source: https://evalprotocol.io/example/apps-coding
Evaluate competitive programming abilities using APPS dataset with comprehensive test suites
This example demonstrates how to create comprehensive competitive programming evaluations using the APPS (Automated Programming Progress Standard) dataset from CodeParrot. The evaluation tests AI models' ability to solve complex algorithmic challenges similar to those found in competitive programming contests.
You can find the complete code for this example at [test\_apps\_coding.py](https://github.com/eval-protocol/python-sdk/blob/main/tests/pytest/test_apps_coding.py).
## Understanding APPS Coding Evaluation
APPS coding evaluation assesses a model's ability to:
* **Solve complex algorithmic problems**: Handle competitive programming challenges with multiple constraints
* **Implement sophisticated logic**: Design algorithms for graph theory, dynamic programming, and data structures
* **Handle multiple test cases**: Pass comprehensive test suites with edge cases and boundary conditions
* **Work with competitive formats**: Process standard input/output formats used in programming contests
Unlike basic coding tasks that test simple function implementation, APPS evaluation tests **advanced algorithmic thinking and competitive programming skills** - essential for building AI systems capable of complex problem-solving.
## Understanding the APPS Dataset Structure
The APPS dataset from CodeParrot contains 10,000 competitive programming problems sourced from platforms like Codeforces, AtCoder, Kattis, and Codewars, providing realistic algorithmic challenges at three difficulty levels.
### Dataset Format
Each entry in the APPS dataset contains:
* **`problem_id`**: Unique identifier for the problem
* **`question`**: Detailed problem description with constraints, examples, and input/output format
* **`solutions`**: Array of reference Python solutions that correctly solve the problem
* **`input_output`**: JSON containing comprehensive test cases with inputs and expected outputs
* **`difficulty`**: Classification as "introductory", "interview", or "competition"
* **`url`**: Source URL of the original problem from competitive programming platforms
* **`starter_code`**: Optional template code to begin implementation
### Example APPS Dataset Entry
**Competitive Programming Problem:**
```json
{
"id": 1,
"question": "Mikhail walks on a Cartesian plane. He starts at the point $(0, 0)$, and in one move he can go to any of eight adjacent points. For example, if Mikhail is currently at the point $(0, 0)$, he can go to any of the following points in one move: $(1, 0)$; $(1, 1)$; $(0, 1)$; $(-1, 1)$; $(-1, 0)$; $(-1, -1)$; $(0, -1)$; $(1, -1)$.\n\nIf Mikhail goes from the point $(x1, y1)$ to the point $(x2, y2)$ in one move, and $x1 \ne x2$ and $y1 \ne y2$, then such a move is called a diagonal move.\n\nMikhail has $q$ queries. For the $i$-th query Mikhail's target is to go to the point $(n_i, m_i)$ from the point $(0, 0)$ in exactly $k_i$ moves...",
"solutions": [
"q=int(input())\n\nfor e in range(q):\n x,y,k=list(map(int,input().split()))\n x,y=abs(x),abs(y)\n x,y=max(x,y),min(x,y)\n # ... complete solution"
],
"input_output": {
"inputs": [
"3
2 2 3
4 3 7
10 1 9"
],
"outputs": [
"1
6
-1"
]
},
"difficulty": "interview",
"url": "https://codeforces.com/problemset/problem/1036/B",
"starter_code": ""
}
```
### Dataset Characteristics
**Problem Complexity**: APPS problems feature advanced algorithmic concepts:
* **Graph algorithms**: Shortest paths, minimum spanning trees, graph traversal
* **Dynamic programming**: Optimization problems with overlapping subproblems
* **Data structures**: Advanced usage of heaps, trees, and custom data structures
* **Mathematical algorithms**: Number theory, combinatorics, and geometric problems
* **String algorithms**: Pattern matching, string manipulation, and parsing
**Difficulty Progression**:
* **Introductory (2,889 problems)**: Basic algorithmic concepts and simple implementations
* **Interview (3,592 problems)**: Common coding interview problems with moderate complexity
* **Competition (572 problems)**: Advanced competitive programming challenges
**Test Coverage**: Comprehensive testing ensures robust evaluation:
* **Multiple test cases**: Average of 21.2 test cases per problem
* **Edge cases**: Boundary conditions and corner cases included
* **Performance constraints**: Problems include time and memory limits
* **Real contest data**: Authentic test cases from actual programming competitions
**Sample Dataset**: The EP python-sdk includes a sample APPS dataset with just 3 problems for testing and demonstration purposes. The full CodeParrot APPS dataset contains 10,000 problems across all difficulty levels.
## Step 1: Import Required Dependencies
First, we import the necessary modules from the EP framework:
```python
import json
from typing import Any, Dict, List
from eval_protocol.models import EvaluateResult, EvaluationRow, Message
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
from eval_protocol.rewards.apps_coding_reward import evaluate_apps_solution
```
* `json`: For parsing the complex input/output test case data
* `typing`: Python's typing module for type hints
* `EvaluateResult`, `EvaluationRow`, `Message`: Core EP data structures
* `default_single_turn_rollout_processor`: Default processor for single-turn conversations
* `evaluation_test`: Decorator for configuring evaluation tests
* `evaluate_apps_solution`: Specialized function for evaluating APPS competitive programming solutions
## Step 2: Create the Dataset Adapter
We need to convert the APPS dataset format to the EP's expected format:
```python
def apps_dataset_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
"""
Convert entries from APPS dataset to EvaluationRow objects.
This adapter extracts the problem statement and stores the comprehensive
test cases (input/output pairs) as ground truth for evaluation.
Args:
data: List of APPS dataset entries with problem descriptions and test cases
Returns:
List of EvaluationRow objects ready for evaluation
"""
return [
EvaluationRow(
messages=[Message(role="user", content=row["question"])],
ground_truth=row["input_output"]
)
for row in data
]
```
This adapter:
* Uses the complete problem description as the user message
* Stores the JSON test case data as ground truth for comprehensive evaluation
* Preserves the complex input/output format required for competitive programming
* Creates proper Message objects for the evaluation framework
**Key transformations:**
* **Problem preservation**: Maintains full problem statements with constraints and examples
* **Test case handling**: Preserves multiple test cases with complex input/output formats
* **Ground truth format**: Keeps JSON structure for sophisticated evaluation logic
## Step 3: Configure and Run the Evaluation
We use the `@evaluation_test` decorator to configure the APPS evaluation:
```python
@evaluation_test(
input_dataset=["tests/pytest/data/apps_sample_dataset.jsonl"],
dataset_adapter=apps_dataset_to_evaluation_row,
completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct", "temperature": 0.0, "max_tokens": 4096}],
passed_threshold=0.33,
rollout_processor=SingleTurnRolloutProcessor(),
num_runs=1,
mode="pointwise",
)
def test_apps_code_evaluation(row: EvaluationRow) -> EvaluationRow:
"""
Evaluation function that tests APPS coding problems using evaluate_apps_solution.
Args:
row: EvaluationRow containing the conversation messages and ground_truth as JSON string
Returns:
EvaluationRow with the evaluation result
"""
# Use evaluate_apps_solution directly
result = evaluate_apps_solution(
messages=row.messages,
ground_truth=row.ground_truth,
)
# Set the evaluation result on the row
row.evaluation_result = result
return row
```
**Configuration parameters:**
* `input_dataset`: Path to the APPS dataset JSONL file
* `model`: The model to evaluate (uses a capable model for complex problems)
* `rollout_input_params`: Model parameters with higher token limit for complex solutions
* `threshold_of_success`: 33% success rate threshold (competitive programming is challenging)
* `mode`: `pointwise` for evaluating individual problems independently
* `dataset_adapter`: Function that converts APPS format to EvaluationRow objects
* `rollout_processor`: Uses default single-turn processor
**Evaluation process:**
1. **Problem presentation**: Present the full competitive programming problem to the model
2. **Solution generation**: Model generates a complete algorithmic solution
3. **Code extraction**: Extract Python code from the model's response
4. **Comprehensive testing**: Run solution against all test cases in the problem
5. **Pass rate calculation**: Calculate percentage of test cases passed
## Core Functions Explained
### `evaluate_apps_solution` Function
The `evaluate_apps_solution` function is a specialized evaluation function designed for competitive programming problems that handles complex test case execution and scoring.
**Key Features:**
* **Code extraction**: Identifies and extracts Python code from model responses
* **Test case parsing**: Processes JSON test case data with multiple input/output pairs
* **Secure execution**: Runs code safely with timeouts and resource limitations
* **Comprehensive scoring**: Calculates pass rates across all test cases
* **Error handling**: Provides detailed feedback on compilation and runtime errors
* **Competitive format support**: Handles standard input/output format used in contests
**Function Signature:**
```python
def evaluate_apps_solution(
messages: List[Message],
ground_truth: Optional[str],
**kwargs
) -> EvaluateResult:
```
**Parameters:**
* `messages`: List of conversation messages (problem statement from user, solution from assistant)
* `ground_truth`: JSON string containing test cases with inputs and expected outputs
* `**kwargs`: Additional parameters including execution timeout settings
**Return Value:**
* `EvaluateResult` with pass rate score (0.0 to 1.0) and detailed metrics
### Implementation Details
The `evaluate_apps_solution` function implements a comprehensive evaluation pipeline with robust security and error handling:
**1. Code Extraction Process:**
````python
# Extract Python code from model response
code_solution = _extract_python_code(raw_solution_content)
# Handles various response formats:
# - Markdown code blocks: ```python ... ```
# - Inline code snippets
# - Mixed text and code responses
# - Removes verbose explanations and comments
````
**2. Ground Truth Processing:**
```python
# Parse JSON test case data
if isinstance(ground_truth, str):
in_outs = json.loads(ground_truth) # Parse JSON string
elif isinstance(ground_truth, dict):
in_outs = ground_truth # Already parsed by JSONL loader
# Validate required structure
assert "inputs" in in_outs and "outputs" in in_outs
```
**3. Secure Test Execution:**
The evaluation uses sandboxed execution with comprehensive security measures:
```python
# Force standard input execution path and prepare secure environment
in_outs_for_check = in_outs.copy()
if "fn_name" in in_outs_for_check:
del in_outs_for_check["fn_name"] # Use stdin/stdout testing
# For each test case in the problem:
for i, (test_input, expected_output) in enumerate(zip(inputs, outputs)):
# Prepare secure execution environment
wrapped_code = f"""
import sys
sys.setrecursionlimit(6*10**5)
{standard_imports} # Common competitive programming imports
{user_generated_code}
"""
# Execute in isolated subprocess with resource limits
process = subprocess.run(
[sys.executable, "-c", wrapped_code],
input=test_input,
capture_output=True,
timeout=timeout,
text=True
)
# Compare outputs and record result
if process.returncode == 0:
actual_output = process.stdout.strip()
results.append(actual_output == expected_output.strip())
else:
results.append(False) # Runtime error
```
**Security Features:**
* **Sandboxed execution**: Code runs in isolated subprocess with resource limits
* **Standard I/O redirection**: Test inputs via stdin, outputs captured from stdout
* **Security restrictions**: File system access, network operations, and dangerous imports disabled
* **Resource monitoring**: Memory usage, CPU time, and execution duration tracked
* **Timeout enforcement**: Long-running or infinite loops automatically terminated
**4. Scoring and Error Analysis:**
```python
# Calculate pass rate from results
actual_results = results_list # List of True/False for each test case
num_tests = len(actual_results)
passed_count = sum(1 for res in actual_results if res is True)
score = float(passed_count) / num_tests
# Process execution metadata for detailed error reporting
if exec_metadata_list:
if len(exec_metadata_list) == 1 and exec_metadata_list[0].get("error"):
# Global compilation error
reason_msg += f" Execution Error: {exec_metadata_list[0]['error']}"
elif score == 0.0 and exec_metadata_list[0].get("error_message") == "Wrong Answer":
# Detailed failure analysis with specific test case details
first_fail_meta = exec_metadata_list[0]
reason_msg += (
f". First fail details: Inputs: {first_fail_meta.get('inputs', 'N/A')}, "
f"Expected: {first_fail_meta.get('expected', 'N/A')}, "
f"Got: {first_fail_meta.get('output', 'N/A')}"
)
```
**Error Handling Hierarchy:**
1. **Code extraction failure**: Score 0.0 - No valid Python code found
2. **Compilation errors**: Score 0.0 - Syntax errors prevent execution
3. **Runtime errors**: Per-test-case failure - Exceptions during execution
4. **Timeout errors**: Per-test-case failure - Exceeded time limits
5. **Wrong output**: Per-test-case failure - Incorrect results but valid execution
6. **Perfect execution**: Score 1.0 - All test cases pass with correct outputs
**Result Types:**
* **True**: Test case passed with correct output
* **False**: Test case failed (wrong output)
* **-1**: Runtime error or timeout
* **-2**: Compilation error
**Example Evaluation Flow:**
```python
# Problem: Mikhail's diagonal moves (from example above)
# Model generates solution
result = evaluate_apps_solution(
messages=[Message(role="user", content=problem_description)],
ground_truth='{"inputs": ["3\\n2 2 3\\n4 3 7\\n10 1 9\\n"], "outputs": ["1\\n6\\n-1\\n"]}'
)
# Result might be:
# EvaluateResult(
# score=1.0, # All test cases passed
# reason="Passed 1/1 test cases",
# metrics={
# "pass_rate": MetricResult(score=1.0, reason="1/1"),
# "execution_metadata": MetricResult(...)
# }
# )
```
## Evaluation Scenarios and Results
The APPS coding evaluation handles various competitive programming scenarios:
### Perfect Solution (Score: 1.0)
**Scenario**: Model correctly solves all test cases
```python
# Problem: Mikhail's diagonal moves
# Model provides optimal solution using coordinate geometry
q = int(input())
for _ in range(q):
x, y, k = list(map(int, input().split()))
x, y = abs(x), abs(y)
# ... correct algorithm implementation
# Handles all coordinate movement constraints
# Result: ✅ Passed 3/3 test cases (100% success rate)
```
### Partial Solution (Score: 0.67)
**Scenario**: Model solves most test cases but fails on edge cases
```python
# Problem: Mikhail's diagonal moves
# Model has correct main logic but misses boundary condition
q = int(input())
for _ in range(q):
x, y, k = list(map(int, input().split()))
# ... mostly correct implementation
# Fails on impossible movement case
# Result: ⚠️ Passed 2/3 test cases (67% success rate)
```
### Algorithmic Error (Score: 0.0)
**Scenario**: Model uses incorrect algorithm approach
```python
# Problem: Mikhail's diagonal moves
# Model uses incorrect movement calculation
q = int(input())
for _ in range(q):
x, y, k = list(map(int, input().split()))
# Incorrect approach - doesn't consider diagonal optimization
print(k) # Always outputs k regardless of constraints
# Result: ❌ Passed 0/3 test cases - Wrong algorithmic approach
```
### Timeout Error (Score: 0.0)
**Scenario**: Model solution exceeds time limits
```python
# Problem: Mikhail's diagonal moves
# Model uses inefficient brute force instead of mathematical approach
q = int(input())
for _ in range(q):
x, y, k = list(map(int, input().split()))
# Simulates all possible paths - exponential complexity
# Times out on larger coordinate values
# Result: ❌ Execution timeout - Algorithm too slow for constraints
```
### Compilation Error (Score: 0.0)
**Scenario**: Model generates syntactically incorrect code
```python
# Problem: Mikhail's diagonal moves
# Model has syntax errors
q = int(input())
for _ in range(q) # Missing colon
x, y, k = list(map(int, input().split()))
# ... rest of solution
# Result: ❌ Compilation error: SyntaxError - Invalid Python syntax
```
## Conclusion
This APPS coding evaluation demonstrates how to assess AI models' competitive programming capabilities using comprehensive algorithmic challenges. The evaluation ensures models can understand complex problem statements, design efficient algorithms, and implement solutions that pass rigorous test suites.
This evaluation is particularly valuable for:
* **Algorithmic reasoning assessment**: Testing advanced problem-solving capabilities
* **Competitive programming preparation**: Validating solutions against contest-quality problems
* **Algorithm implementation**: Ensuring correct and efficient code generation
The APPS evaluation focuses on **algorithmic correctness and efficiency** rather than simple function implementation, making it essential for building AI systems capable of sophisticated problem-solving. It provides comprehensive testing with real competitive programming challenges and detailed performance metrics.
# Basic Coding Evaluation
Source: https://evalprotocol.io/example/basic-coding
Evaluate code correctness by executing Python functions and comparing outputs
This example demonstrates how to create comprehensive basic coding evaluations using the Eval Protocol (EP) framework. The evaluation uses code execution functions to test whether models can write correct Python functions that produce expected outputs when executed with specific inputs.
You can find the complete code for this example at [test\_basic\_coding.py](https://github.com/eval-protocol/python-sdk/blob/main/tests/pytest/test_basic_coding.py).
## Understanding Basic Coding Evaluation
Basic coding evaluation assesses a model's ability to:
* **Write syntactically correct code**: Generate valid Python syntax without errors
* **Implement correct logic**: Create functions that perform the specified operations
* **Handle different inputs**: Process various input values correctly (positive, negative, zero, edge cases)
* **Produce exact outputs**: Return results that match expected values precisely
Unlike text-based evaluations that focus on natural language generation, coding evaluations test a model's **programming capabilities and logical reasoning** - essential skills for AI systems that need to write functional code.
## Understanding the Dataset Structure
The basic coding dataset contains simple programming tasks that evaluate fundamental coding skills, from arithmetic operations to data structure manipulation.
### Dataset Format
Each entry in the dataset contains:
* **`prompt`**: The coding task description specifying what function to write
* **`input`**: Test input value to pass to the function
* **`expected_output`**: The correct output the function should return
### Example Dataset Entries
**Simple Addition Function:**
```json
{
"prompt": "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1.",
"input": "5",
"expected_output": "6"
}
```
**Multiplication Function:**
```json
{
"prompt": "Write a Python function `multiply_by_two` that takes an integer and returns the integer multiplied by 2.",
"input": "3",
"expected_output": "6"
}
```
**List Operations:**
```json
{
"prompt": "Write a Python function `get_length` that takes a list and returns its length.",
"input": "[1, 2, 3]",
"expected_output": "3"
}
```
## Step 1: Import Required Dependencies
First, we import the necessary modules from the EP framework:
```python
from typing import Any, Dict, List
from eval_protocol.models import EvaluateResult, EvaluationRow, Message
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
from eval_protocol.rewards.code_execution import extract_code_blocks, execute_python_code
```
* `typing`: Python's typing module for type hints (Any, Dict, List)
* `EvaluateResult`: Result object containing evaluation score and reasoning
* `EvaluationRow`: Data structure containing conversation messages and ground truth
* `Message`: Individual message in the conversation
* `default_single_turn_rollout_processor`: Default processor for single-turn conversations
* `evaluation_test`: Decorator for configuring evaluation tests
* `extract_code_blocks`: Function to extract Python code from markdown code blocks
* `execute_python_code`: Function to safely execute Python code and capture output
## Step 2: Create the Dataset Adapter
We need to convert the basic coding dataset format to the EP's expected format:
```python
def coding_dataset_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
"""
Convert entries from coding dataset to EvaluationRow objects.
This adapter combines the coding prompt with the test input to create
a complete user message, and stores the expected output as ground truth
for comparison during evaluation.
Args:
data: List of coding dataset entries with prompt, input, and expected_output
Returns:
List of EvaluationRow objects ready for evaluation
"""
return [
EvaluationRow(
messages=[Message(role="user", content=f"{row['prompt']} Input: {row['input']}")],
ground_truth=row["expected_output"]
)
for row in data
]
```
This adapter:
* Combines the coding prompt with the test input into a single user message
* Stores the expected output as ground truth for comparison
* Creates Message objects with the proper role and content structure
* Returns a list of EvaluationRow objects that the framework can process
**Key transformations:**
* **Message construction**: Combines prompt and input into clear instructions
* **Ground truth preservation**: Maintains expected output for exact comparison
* **Role assignment**: Sets proper user role for the coding request
## Step 3: Configure and Run the Evaluation
We use the `@evaluation_test` decorator to configure the evaluation:
```python
@evaluation_test(
input_dataset=["tests/pytest/data/basic_coding_dataset.jsonl"],
dataset_adapter=coding_dataset_to_evaluation_row,
completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct", "temperature": 0.0, "max_tokens": 4096}],
passed_threshold=0.8,
rollout_processor=SingleTurnRolloutProcessor(),
num_runs=1,
mode="pointwise",
)
async def test_coding_code_evaluation(row: EvaluationRow) -> EvaluationRow:
"""
Evaluation function that tests code correctness by executing it locally.
This function:
1. Extracts Python code from the assistant's response
2. Executes the code locally with timeout=10
3. Compares the output to ground_truth
4. Returns a score of 1.0 if output matches, 0.0 otherwise
Args:
row: EvaluationRow containing the conversation messages and expected_output in ground_truth
Returns:
EvaluationRow with the evaluation result
"""
# Check if we have an assistant response
if len(row.messages) < 2 or row.messages[-1].role != "assistant":
row.evaluation_result = EvaluateResult(score=0.0, reason="No assistant response found")
return row
assistant_content = row.messages[-1].content or ""
expected_output = (row.ground_truth or "").strip()
# Extract Python code blocks
code_blocks = extract_code_blocks(assistant_content, language="python")
if not code_blocks:
row.evaluation_result = EvaluateResult(score=0.0, reason="No Python code block found")
return row
code = code_blocks[0]["code"]
# Execute the code locally
execution_result = execute_python_code(code, timeout=10)
if not execution_result.get("success", False):
error_msg = execution_result.get("error", "Code execution failed")
row.evaluation_result = EvaluateResult(score=0.0, reason=f"Execution error: {error_msg}")
return row
# Compare output with expected
actual_output = (execution_result.get("output", "") or "").strip()
if actual_output == expected_output:
row.evaluation_result = EvaluateResult(
score=1.0,
reason=f"✅ Output matches: '{actual_output}'"
)
else:
row.evaluation_result = EvaluateResult(
score=0.0,
reason=f"❌ Expected: '{expected_output}', Got: '{actual_output}'"
)
return row
```
**Configuration parameters:**
* `input_dataset`: Path to the basic coding dataset JSONL file
* `model`: The model to evaluate (Fireworks Kimi model in this case)
* `rollout_input_params`: Model parameters including temperature=0.0 for deterministic results
* `threshold_of_success`: 80% success rate threshold for the evaluation
* `mode`: `pointwise` for evaluating individual rows independently
* `dataset_adapter`: Function that converts coding format to EvaluationRow objects
* `rollout_processor`: Uses default single-turn processor for coding evaluations
**Evaluation process:**
1. **Validate response**: Ensure we have a valid assistant response containing code
2. **Extract code**: Use `extract_code_blocks` to find Python code in markdown blocks
3. **Execute safely**: Run the code in a secure environment with timeout protection
4. **Compare output**: Perform exact string comparison between actual and expected results
5. **Return score**: Provide binary score (1.0 for exact match, 0.0 for any difference)
## Core Functions Explained
### `extract_code_blocks` Function
The `extract_code_blocks` function identifies and extracts Python code from the model's response, typically from markdown code blocks.
**Key Features:**
* **Markdown parsing**: Identifies \`\`\`python code blocks in responses
* **Language filtering**: Can filter for specific programming languages
* **Content cleaning**: Removes verbose explanatory text that might interfere with execution
* **Multiple blocks**: Can extract multiple code blocks if present
**Function Signature:**
```python
def extract_code_blocks(text: str, language: Optional[str] = None) -> List[Dict[str, str]]:
```
**Parameters:**
* `text`: The assistant's response containing code
* `language`: Optional language filter (e.g., "python")
**Return Value:**
* List of dictionaries with "code" and "language" keys
**Example Usage:**
```python
response = """
Here's the solution:
\`\`\`python
def add_one(x):
return x + 1
\`\`\`
This function takes an integer and returns it incremented by 1.
"""
code_blocks = extract_code_blocks(response, language="python")
print(code_blocks[0]["code"]) # "def add_one(x):\n return x + 1"
```
### `execute_python_code` Function
The `execute_python_code` function safely executes Python code in a controlled environment with security restrictions and resource limits.
**Key Features:**
* **Secure execution**: Runs code in a subprocess with memory and time limits
* **Safety guards**: Disables dangerous operations like file system access
* **Timeout protection**: Prevents infinite loops and long-running code
* **Error handling**: Captures and reports execution errors clearly
* **Output capture**: Returns both stdout and stderr from execution
**Function Signature:**
```python
def execute_python_code(code: str, timeout: int = 5) -> Dict[str, Any]:
```
**Parameters:**
* `code`: Python code to execute
* `timeout`: Maximum execution time in seconds
**Return Value:**
* Dictionary with execution results including success status, output, and errors
**Example Usage:**
```python
code = """
def add_one(x):
return x + 1
result = add_one(5)
print(result)
"""
result = execute_python_code(code, timeout=10)
if result["success"]:
print(f"Output: {result['output']}") # "Output: 6"
else:
print(f"Error: {result['error']}")
```
### Security and Safety Features
The code execution environment includes several safety measures:
**Resource Limits:**
* **Memory limits**: Restricts memory usage to prevent excessive consumption
* **CPU limits**: Prevents long-running computations
* **Timeout enforcement**: Kills processes that exceed time limits
**Disabled Operations:**
* **File system access**: Prevents reading/writing files
* **Network operations**: Blocks network requests
* **System calls**: Disables potentially dangerous system operations
* **Process spawning**: Prevents creating new processes
**Error Handling:**
* **Exception capture**: Catches and reports Python exceptions
* **Timeout detection**: Identifies and reports timeout errors
* **Resource exhaustion**: Handles memory and CPU limit violations
## Evaluation Scenarios and Results
The basic coding evaluation handles various scenarios with different outcomes:
### Perfect Implementation (Score: 1.0)
**Scenario**: Model writes correct function that produces expected output
```python
# User prompt: "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1. Input: 5"
# Model response:
def add_one(x):
return x + 1
result = add_one(5)
print(result)
```
**Result**: ✅ Output matches: '6' - Function correctly implements the required logic
### Syntax Error (Score: 0.0)
**Scenario**: Model writes code with syntax errors
```python
# User prompt: "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1. Input: 5"
# Model response:
def add_one(x) # Missing colon
return x + 1
result = add_one(5)
print(result)
```
**Result**: ❌ Execution error: SyntaxError - Invalid Python syntax prevents execution
### Logic Error (Score: 0.0)
**Scenario**: Model writes syntactically correct but logically incorrect code
```python
# User prompt: "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1. Input: 5"
# Model response:
def add_one(x):
return x + 2 # Wrong logic: adds 2 instead of 1
result = add_one(5)
print(result)
```
**Result**: ❌ Expected: '6', Got: '7' - Logic error produces wrong output
### Missing Function Call (Score: 0.0)
**Scenario**: Model defines function but doesn't call it with the input
```python
# User prompt: "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1. Input: 5"
# Model response:
def add_one(x):
return x + 1
# Missing: result = add_one(5)
# Missing: print(result)
```
**Result**: ❌ Expected: '6', Got: '' - No output produced
### Runtime Error (Score: 0.0)
**Scenario**: Model writes code that fails during execution
```python
# User prompt: "Write a Python function `get_length` that takes a list and returns its length. Input: [1, 2, 3]"
# Model response:
def get_length(lst):
return lst.length() # Wrong method: should use len()
result = get_length([1, 2, 3])
print(result)
```
**Result**: ❌ Execution error: AttributeError - Runtime error during function call
### Edge Case Handling (Score: 1.0)
**Scenario**: Model correctly handles edge cases like empty lists or zero values
```python
# User prompt: "Write a Python function `get_length` that takes a list and returns its length. Input: []"
# Model response:
def get_length(lst):
return len(lst)
result = get_length([])
print(result)
```
**Result**: ✅ Output matches: '0' - Correctly handles empty list edge case
## Conclusion
This basic coding evaluation demonstrates how to assess AI models' programming capabilities using code execution and output comparison. The evaluation ensures models can write syntactically correct code, implement proper logic, handle various inputs, and produce exact expected outputs.
This evaluation is particularly valuable for:
* **AI model assessment**: Evaluating language models' programming capabilities
* **Code generation tools**: Validating the correctness of automatically generated code
* **Algorithm testing**: Ensuring implementations produce correct results
The basic coding evaluation focuses on **functional correctness** rather than code style or efficiency, making it essential for building reliable AI systems that can write working code. It provides objective scoring with secure execution, immediate feedback, and scalable automated testing.
# Function Calling Evaluation
Source: https://evalprotocol.io/example/function-calling
Evaluate function calling accuracy with exact tool match comparison
This example demonstrates how to create comprehensive function calling evaluations using the Eval Protocol (EP) framework. The evaluation uses the `exact_tool_match_reward` function to assess whether models correctly call the right functions with the correct arguments in the expected format.
You can find the complete code for this example at [test\_pytest\_function\_calling.py](https://github.com/eval-protocol/python-sdk/blob/main/tests/pytest/test_pytest_function_calling.py).
## Understanding Function Calling Evaluation
Function calling evaluation assesses a model's ability to:
* **Identify when to use tools**: Determine if a user query requires function execution
* **Select the correct function**: Choose the appropriate tool from available options
* **Provide accurate arguments**: Pass the right parameters with correct values
* **Follow proper formatting**: Use the expected tool call structure
Unlike text-based evaluations that focus on content generation, function calling evaluations test a model's **tool selection and parameterization capabilities** - critical skills for AI agents that interact with external systems.
## Understanding the Dataset Structure
The function calling dataset contains diverse test cases that evaluate different aspects of tool usage, from simple weather queries to complex nested object creation.
### Dataset Format
Each entry in the dataset contains:
* **`messages`**: Conversation history with user queries and assistant responses
* **`tools`**: Available function definitions with schemas
* **`ground_truth`**: Expected tool calls in JSON format
* **`evaluation_result`**: Pre-computed evaluation scores for validation
* **`input_metadata`**: Additional context including task type and difficulty
### Example Dataset Entries
**Perfect Match - Weather Query:**
```json
{
"messages": [
{"role": "user", "content": "What's the weather in London?"},
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"
}
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather information for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location", "unit"]
}
}
}
],
"ground_truth": "{\"tool_calls\": [{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"arguments\": \"{\\\"location\\\": \\\"London\\\", \\\"unit\\\": \\\"celsius\\\"}\"}}]}",
"input_metadata": {
"row_id": "weather_london_perfect",
"dataset_info": {"task_type": "function_calling", "difficulty": "easy"}
}
}
```
**Argument Mismatch - Wrong Unit:**
```json
{
"messages": [
{"role": "user", "content": "What's the weather in London?"},
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"London\", \"unit\": \"fahrenheit\"}"
}
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather information for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location", "unit"]
}
}
}
],
"ground_truth": "{\"tool_calls\": [{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"arguments\": \"{\\\"location\\\": \\\"London\\\", \\\"unit\\\": \\\"celsius\\\"}\"}}]}",
"input_metadata": {
"row_id": "weather_london_unit_mismatch",
"dataset_info": {"task_type": "function_calling", "difficulty": "easy"}
}
}
```
**Function Name Mismatch:**
```json
{
"messages": [
{"role": "user", "content": "What's the weather in London?"},
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "fetch_weather",
"arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"
}
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather information for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location", "unit"]
}
}
}
],
"ground_truth": "{\"tool_calls\": [{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"arguments\": \"{\\\"location\\\": \\\"London\\\", \\\"unit\\\": \\\"celsius\\\"}\"}}]}",
"input_metadata": {
"row_id": "weather_london_name_mismatch",
"dataset_info": {"task_type": "function_calling", "difficulty": "easy"}
}
}
```
**No Tool Call Expected:**
```json
{
"messages": [
{"role": "user", "content": "Tell me a joke."},
{"role": "assistant", "content": "Why did the chicken cross the road?"}
],
"tools": [],
"ground_truth": "{\"tool_calls\": []}",
"input_metadata": {
"row_id": "joke_no_calls",
"dataset_info": {"task_type": "function_calling", "difficulty": "easy"}
}
}
```
**Complex Nested Object Creation:**
```json
{
"messages": [
{"role": "user", "content": "Create a user for John Doe"},
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "create_user",
"arguments": "{\"user\": {\"firstName\": \"John\", \"lastName\": \"Doe\", \"age\": 30}}"
}
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "create_user",
"description": "Create a new user",
"parameters": {
"type": "object",
"properties": {
"user": {
"type": "object",
"properties": {
"firstName": {"type": "string"},
"lastName": {"type": "string"},
"age": {"type": "number"}
},
"required": ["firstName", "lastName", "age"]
}
},
"required": ["user"]
}
}
}
],
"ground_truth": "{\"tool_calls\": [{\"type\": \"function\", \"function\": {\"name\": \"create_user\", \"arguments\": \"{\\\"user\\\": {\\\"firstName\\\": \\\"John\\\", \\\"lastName\\\": \\\"Doe\\\", \\\"age\\\": 30}}\"}}]}",
"input_metadata": {
"row_id": "create_user_nested",
"dataset_info": {"task_type": "function_calling", "difficulty": "hard"}
}
}
```
### Dataset Characteristics
**Test Scenarios**: The dataset covers various function calling challenges:
* **Perfect matches**: Correct function name and arguments
* **Argument mismatches**: Wrong parameter values (e.g., wrong temperature unit)
* **Function name errors**: Calling non-existent or wrong functions
* **Extra calls**: Making unnecessary tool calls
* **Missing calls**: Failing to call required functions
* **No-call scenarios**: Queries that don't require function execution
* **Complex objects**: Nested parameter structures
* **Invalid JSON**: Malformed argument strings
**Tool Types**: Various function categories:
* **Weather services**: Location-based queries with units
* **User management**: CRUD operations with complex objects
* **Data retrieval**: Search and find operations
* **Utility functions**: Simple parameterized operations
**Difficulty Levels**: Progressive complexity:
* **Easy**: Simple single-parameter calls
* **Medium**: Multi-parameter calls with validation
* **Hard**: Nested object structures and complex schemas
## Step 1: Import Required Dependencies
First, we import the necessary modules from the EP framework:
```python
import json
from typing import Any, Dict, List
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
from eval_protocol.rewards.function_calling import exact_tool_match_reward
```
* `json`: Python's JSON module for parsing ground truth data
* `typing`: Python's typing module for type hints (Any, Dict, List)
* `EvaluationRow`: The data structure containing conversation messages and metadata
* `default_single_turn_rollout_processor`: Default processor for single-turn conversations
* `evaluation_test`: Decorator for configuring evaluation tests
* `exact_tool_match_reward`: Built-in function calling evaluation function
## Step 2: Create the Dataset Adapter
We need to convert the function calling dataset format to the EP's expected format:
```python
def function_calling_to_evaluation_row(rows: List[Dict[str, Any]]) -> List[EvaluationRow]:
"""
Convert function calling dataset entries to EvaluationRow objects.
This adapter extracts the conversation messages, available tools, and ground truth
from the function calling dataset format and creates EvaluationRow objects that
the EP framework can process.
Args:
rows: List of function calling dataset entries
Returns:
List of EvaluationRow objects ready for evaluation
"""
dataset: List[EvaluationRow] = []
for row in rows:
dataset.append(
EvaluationRow(
messages=row["messages"][:1], # Only the user message
tools=row["tools"], # Available function definitions
ground_truth=row["ground_truth"] # Expected tool calls
)
)
return dataset
```
This adapter:
* Takes the raw function calling dataset as a list of dictionaries
* Extracts the user message (first message in the conversation)
* Includes the available tools/function definitions
* Sets the ground truth to the expected tool calls
* Returns the list of evaluation rows
**Key transformations:**
* **Message extraction**: Uses only the user message since the assistant's response will be generated during evaluation
* **Tool preservation**: Maintains the function schemas for context
* **Ground truth**: Preserves the expected tool calls for comparison
## Step 3: Configure and Run the Evaluation
We use the `@evaluation_test` decorator to configure the evaluation:
```python
@evaluation_test(
input_dataset=["tests/pytest/data/function_calling.jsonl"],
completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct"}],
mode="pointwise",
dataset_adapter=function_calling_to_evaluation_row,
rollout_processor=SingleTurnRolloutProcessor(),
)
async def test_pytest_function_calling(row: EvaluationRow) -> EvaluationRow:
"""Run pointwise evaluation on sample dataset using pytest interface."""
ground_truth = json.loads(row.ground_truth)
result = exact_tool_match_reward(row.messages, ground_truth)
row.evaluation_result = result
print(result)
return row
```
**Configuration parameters:**
* `input_dataset`: Path to the function calling dataset JSONL file
* `model`: The model to evaluate (Fireworks Kimi model in this case)
* `mode`: `pointwise` for evaluating individual rows since each row can be evaluated independently
* `dataset_adapter`: Function that converts function calling format to EvaluationRow objects
* `rollout_processor`: Uses default single-turn processor for function calling evaluations
**Evaluation process:**
1. **Parse ground truth**: Convert the JSON string to a dictionary for comparison
2. **Extract tool calls**: The `exact_tool_match_reward` function analyzes the assistant's response
3. **Compare exactly**: Check if function names, arguments, and order match perfectly
4. **Return results**: Provide binary score (1.0 for perfect match, 0.0 for any mismatch)
## Core Functions Explained
### `exact_tool_match_reward` Function
The `exact_tool_match_reward` function is a built-in evaluation function that performs exact matching between generated and expected tool calls. It's located in `eval_protocol.rewards.function_calling`.
**Key Features:**
* **Exact matching**: Requires perfect alignment of function names, arguments, and order
* **Multiple formats**: Handles both structured tool calls and XML-formatted calls
* **JSON parsing**: Automatically deserializes and normalizes tool call arguments
* **Robust comparison**: Uses sorted JSON serialization for consistent comparison
* **Error handling**: Gracefully handles malformed inputs and edge cases
**Function Signature:**
```python
def exact_tool_match_reward(
messages: Union[List[Message], List[Dict[str, Any]]],
ground_truth: Optional[Dict[str, Any]] = None,
**kwargs: Any,
) -> EvaluateResult:
```
**Parameters:**
* `messages`: List of conversation messages (extracts tool calls from the last assistant message)
* `ground_truth`: Expected tool calls dictionary for comparison
* `**kwargs`: Additional parameters (not used in this implementation)
**Return Value:**
* `EvaluateResult` with score (1.0 for exact match, 0.0 for any mismatch) and detailed reasoning
**Example Usage:**
```python
result = exact_tool_match_reward(
messages=messages,
ground_truth={
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "London", "unit": "celsius"}'
}
}
]
}
)
print(f"Score: {result.score}") # 1.0 if exact match, 0.0 otherwise
print(f"Reason: {result.reason}") # Detailed explanation of the evaluation
```
### `eval_tool_call` Function
The core evaluation logic is implemented in the `eval_tool_call` function, which handles the detailed comparison of tool calls.
**Function Signature:**
```python
def eval_tool_call(generation: dict, ground_truth: dict) -> bool:
```
**Implementation Details:**
1. **Extract expected calls**: Parse ground truth tool calls from the expected format
2. **Process generated calls**: Handle both structured tool calls and XML-formatted calls
3. **Normalize formats**: Convert all calls to a consistent internal format
4. **Compare exactly**: Use JSON serialization with sorted keys for deterministic comparison
**Supported Formats:**
* **Structured tool calls**: Standard OpenAI format with `tool_calls` array
* **XML-formatted calls**: `...` tags in content
* **Mixed formats**: Combinations of different call types
### `compare_tool_calls` Function
The final comparison is performed by the `compare_tool_calls` function, which ensures exact matching.
**Function Signature:**
```python
def compare_tool_calls(generated_tool_calls: list, gt_tool_calls: list) -> bool:
```
**Comparison Logic:**
1. **Length check**: Number of tool calls must match exactly
2. **JSON serialization**: Convert each tool call to sorted JSON string
3. **Exact matching**: Compare serialized strings for perfect equality
4. **Order matters**: Tool calls must be in the same sequence
**Example Comparison:**
```python
# Generated calls
generated = [
{"name": "get_weather", "arguments": '{"location": "London", "unit": "celsius"}'}
]
# Expected calls
expected = [
{"name": "get_weather", "arguments": '{"location": "London", "unit": "celsius"}'}
]
# Result: True (exact match)
```
## Evaluation Scenarios and Results
The function calling evaluation handles various scenarios with different outcomes:
### Perfect Match (Score: 1.0)
**Scenario**: Model calls the exact function with correct arguments
```json
{
"generated": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"},
"expected": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"}
}
```
**Result**: ✅ Perfect match - all function names, arguments, and order are correct
### Argument Mismatch (Score: 0.0)
**Scenario**: Model calls correct function but with wrong arguments
```json
{
"generated": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"fahrenheit\"}"},
"expected": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"}
}
```
**Result**: ❌ Argument mismatch - wrong temperature unit specified
### Function Name Error (Score: 0.0)
**Scenario**: Model calls wrong function name
```json
{
"generated": {"name": "fetch_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"},
"expected": {"name": "get_weather", "arguments": "{\"location\": \"London\", \"unit\": \"celsius\"}"}
}
```
**Result**: ❌ Function name error - called non-existent function
### Extra Tool Call (Score: 0.0)
**Scenario**: Model makes unnecessary additional calls
```json
{
"generated": [
{"name": "get_weather", "arguments": "{\"location\": \"London\"}"},
{"name": "extra_call", "arguments": "{}"}
],
"expected": [
{"name": "get_weather", "arguments": "{\"location\": \"London\"}"}
]
}
```
**Result**: ❌ Extra tool call - made unnecessary additional function call
### Missing Tool Call (Score: 0.0)
**Scenario**: Model fails to call required function
```json
{
"generated": [],
"expected": [
{"name": "get_weather", "arguments": "{\"location\": \"London\"}"}
]
}
```
**Result**: ❌ Missing tool call - failed to call required function
### No Call Expected (Score: 1.0)
**Scenario**: Query doesn't require function execution
```json
{
"generated": [],
"expected": []
}
```
**Result**: ✅ No call expected - correctly avoided unnecessary function calls
## Advanced Features
### XML-Formatted Tool Calls
The evaluation supports XML-formatted tool calls embedded in content:
```python
# Assistant response with XML formatting
content = '{"type": "function", "function": {"name": "get_weather", "arguments": "{\\"location\\": \\"Berlin\\", \\"unit\\": \\"celsius\\"}"}}'
# The evaluation automatically parses and compares these calls
```
### Complex Nested Objects
The evaluation handles complex parameter structures:
```python
# Nested user object creation
{
"name": "create_user",
"arguments": '{"user": {"firstName": "John", "lastName": "Doe", "age": 30}}'
}
```
### Multiple Tool Calls
The evaluation supports scenarios with multiple sequential tool calls:
```python
# Multiple weather queries
[
{"name": "get_weather", "arguments": '{"location": "London"}'},
{"name": "get_weather", "arguments": '{"location": "Paris"}'}
]
```
## Best Practices for Function Calling Evaluation
### Dataset Design
* **Diverse scenarios**: Include various failure modes and edge cases
* **Progressive difficulty**: Start with simple calls and progress to complex objects
* **Real-world examples**: Use realistic function schemas and use cases
* **Clear ground truth**: Ensure expected tool calls are unambiguous
### Evaluation Configuration
* **Appropriate models**: Use models with strong function calling capabilities
* **Consistent parameters**: Use deterministic settings (temperature=0.0) for reproducible results
* **Adequate context**: Provide clear function descriptions and examples
* **Error handling**: Gracefully handle parsing errors and edge cases
### Result Interpretation
* **Binary scoring**: Understand that this is a strict exact-match evaluation
* **Detailed analysis**: Use the reasoning field to understand specific failures
* **Pattern recognition**: Look for systematic errors in function selection or argument formatting
* **Model comparison**: Compare different models' function calling accuracy
## Conclusion
This function calling evaluation example demonstrates how to create robust assessments of AI models' tool usage capabilities. The `exact_tool_match_reward` function provides a strict but comprehensive evaluation that ensures models can:
1. **Identify when tools are needed**: Distinguish between queries requiring function calls and those that don't
2. **Select appropriate functions**: Choose the correct tool from available options
3. **Provide accurate parameters**: Pass the right arguments with correct values
4. **Follow proper formatting**: Use the expected tool call structure consistently
This evaluation is particularly valuable for:
* **Agent development**: Ensuring AI agents can reliably interact with external systems
* **API integration**: Validating models' ability to use structured APIs correctly
* **Tool selection**: Testing models' understanding of when and how to use different tools
* **Parameter accuracy**: Verifying that models provide correct input values
The function calling evaluation complements other evaluation types by focusing on **execution accuracy** rather than content generation, making it essential for building reliable AI systems that can interact with external tools and APIs.
# GPQA (Open-Resource)
Source: https://evalprotocol.io/example/gpqa
Multiple-choice science QA with simple exact-match scoring
This example runs a minimal GPQA-style evaluation using the public Diamond split CSV. It’s meant for quick comparisons during model picking, not a full benchmark reproduction.
This example is implemented as a suite in `eval_protocol/benchmarks/suites/gpqa.py` and exported as `gpqa`.
## What it does
* Downloads the GPQA Diamond CSV and constructs MCQ prompts (A–D).
* Appends a system-side ground-truth token (e.g., `__GT__:A`) per row.
* Extracts the predicted letter from the assistant’s final message and checks exact match.
## How it’s configured
* `@evaluation_test` feeds prebuilt `input_messages` and sets rollout parameters.
* Simple scoring: 1.0 for exact letter match, else 0.0.
## Run it locally
After installing eval-protocol, you can run the benchmark from anywhere:
```bash
pytest --pyargs eval_protocol.benchmarks.test_gpqa -v \
--ep-print-summary --ep-summary-json artifacts/gpqa.json
```
Use `--ep-max-rows=20` to tune runtime. The CSV is fetched at runtime.
## Notes
* Convenience-oriented: focuses on a clean pipeline and minimal metrics.
* The evaluation relies on extracting exactly one of `A, B, C, D` from the model output.
# GSM8K Math Evaluation
Source: https://evalprotocol.io/example/gsm8k
Evaluate mathematical reasoning with GSM8K dataset using structured thinking format
This example demonstrates how to create a comprehensive math evaluation using the GSM8K dataset. The evaluation combines numerical accuracy checking with format validation, requiring models to follow a structured thinking format with `......` tags.
You can find the complete code for this example at [test\_pytest\_math\_example.py](https://github.com/eval-protocol/python-sdk/blob/main/tests/pytest/test_pytest_math_example.py).
## Understanding the GSM8K Dataset
The GSM8K (Grade School Math 8K) dataset contains grade school math word problems that test mathematical reasoning and problem-solving abilities. Each problem requires multi-step reasoning to arrive at the correct numerical answer.
### Dataset Format
Each entry in the dataset contains:
* **`id`**: Unique identifier for the test case
* **`user_query`**: The math word problem to solve
* **`ground_truth_for_eval`**: The expected solution with step-by-step reasoning and final answer
### Example Dataset Entries
**Basic Arithmetic Problem:**
```json
{
"id": "gsm8k_test_0",
"user_query": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"ground_truth_for_eval": "Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nShe makes 9 * 2 = $<<9*2=18>>18 every day at the farmer's market.\n#### 18"
}
```
**Percentage and Profit Problem:**
```json
{
"id": "gsm8k_test_2",
"user_query": "Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?",
"ground_truth_for_eval": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000"
}
```
### Dataset Characteristics
**Problem Types**: The dataset covers various mathematical concepts:
* Basic arithmetic (addition, subtraction, multiplication, division)
* Percentages and ratios
* Multi-step word problems
* Real-world applications (business, cooking, sports)
**Solution Format**: Ground truth solutions include:
* Step-by-step reasoning with intermediate calculations
* Computed values in `<>` format
* Final answer marked with `#### answer`
**Complexity**: Problems require:
* Understanding of mathematical concepts
* Multi-step reasoning
* Accurate numerical computation
* Clear presentation of work
## Step 1: Import Required Dependencies
First, we import the necessary modules from the EP framework:
```python
import re
from typing import Any, Dict, List
from eval_protocol.models import EvaluateResult, EvaluationRow, MetricResult
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
from eval_protocol.rewards.math import math_reward
from examples.math_example.main import check_think_answer_format
from tests.pytest.helper.gsm8k_to_evaluation_row import gsm8k_to_evaluation_row
```
* `re`: Python's regex module for pattern matching
* `typing`: Python's typing module for type hints (Any, Dict, List)
* `EvaluateResult`: The result object containing evaluation score and reasoning
* `EvaluationRow`: The data structure containing conversation messages and ground truth
* `MetricResult`: Individual metric results for detailed analysis
* `default_single_turn_rollout_processor`: Default processor for single-turn conversations
* `evaluation_test`: Decorator for configuring evaluation tests
* `math_reward`: Built-in math evaluation function
* `check_think_answer_format`: Function to validate structured thinking format
* `gsm8k_to_evaluation_row`: Adapter function to convert GSM8K dataset format
## Step 2: Create the Dataset Adapter
We need to convert the GSM8K dataset format to the EP's expected format:
```python
def gsm8k_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
"""Convert GSM8K dataset entries to EvaluationRow objects."""
return [
EvaluationRow(
messages=[Message(role="user", content=row["user_query"])],
ground_truth=row["ground_truth_for_eval"]
)
for row in data
]
```
This adapter:
* Takes the raw GSM8K dataset as a list of dictionaries
* Converts each row to an `EvaluationRow` with a user message containing the math problem
* Sets the ground truth to the expected solution with step-by-step reasoning
* Returns the list of evaluation rows
## Step 3: Define Format Validation
We create a function to check if the model's response follows the required structured thinking format:
```python
def check_think_answer_format(text: str) -> bool:
"""Check if text follows ...... format."""
if not text:
return False
pattern = r"[\s\S]*?[\s\S]*?[\s\S]*?"
return bool(re.search(pattern, text))
```
**Regex pattern explained:**
* `[\s\S]*?`: Matches the thinking section, including any characters and newlines
* `[\s\S]*?`: Matches any characters (including newlines) between the think and answer tags
* `[\s\S]*?`: Matches the answer section
* `re.search()`: Searches for the pattern anywhere in the text (not requiring it to be the entire text)
This ensures the response contains both `` and `` sections in the correct order.
## Step 4: Configure, implement, and run the evaluation
We use the `@evaluation_test` decorator to configure the evaluation. The evaluation function combines numerical accuracy with format validation.
```python
@evaluation_test(
input_dataset=["development/gsm8k_sample.jsonl"],
dataset_adapter=gsm8k_to_evaluation_row,
completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct", "temperature": 0.0}],
max_dataset_rows=5,
passed_threshold=0.0,
rollout_processor=SingleTurnRolloutProcessor(),
mode="pointwise",
evaluation_test_kwargs=[
{"math_reward_kwargs": {"tolerance": 0.001, "absolute_tolerance": 1e-8, "require_units": False}}
],
)
def test_math_dataset(row: EvaluationRow, **kwargs) -> EvaluationRow:
"""
Evaluate math problem solving considering both accuracy and format.
This function demonstrates how to combine multiple evaluation criteria:
- Numerical accuracy using built-in math evaluation (80% weight)
- Format compliance checking for ...... structure (20% weight)
Args:
row: EvaluationRow containing the conversation messages and ground truth
**kwargs: Additional parameters (like math_reward_kwargs)
Returns:
EvaluationRow with the evaluation result
"""
# Get the assistant's response
assistant_message = row.messages[-1]
if isinstance(assistant_message, dict):
assistant_response = assistant_message.get("content", "")
else:
assistant_response = assistant_message.content or ""
# Evaluate numerical accuracy using built-in function
accuracy_result = math_reward(messages=row.messages, ground_truth=row.ground_truth, **kwargs["math_reward_kwargs"])
# Evaluate format compliance (looking for ...... format)
format_correct = check_think_answer_format(assistant_response)
format_score = 1.0 if format_correct else 0.0
# Calculate combined score with 80% accuracy and 20% formatting weight
combined_score = (0.8 * accuracy_result.score) + (0.2 * format_score)
# Create metrics structure expected by tests
metrics = {
"accuracy_reward": MetricResult(
score=accuracy_result.score,
reason=f"Numerical accuracy: {accuracy_result.reason}",
is_score_valid=True,
),
"format_reward": MetricResult(
score=format_score,
reason=f"Format compliance: {'correct' if format_correct else 'incorrect'} ...... structure",
is_score_valid=True,
),
}
row.evaluation_result = EvaluateResult(
score=combined_score,
reason=f"Combined score: {combined_score:.2f} (accuracy: {accuracy_result.score:.2f}, format: {format_score:.2f})",
metrics=metrics,
)
return row
```
**Key evaluation aspects:**
* **Numerical Accuracy**: Uses the built-in `math_reward` function to check if the final answer matches the ground truth (80% weight)
* **Format Compliance**: Ensures responses follow the structured thinking format (20% weight)
* **Weighted Combination**: Combines accuracy and format scores using 80% accuracy + 20% formatting weights
* **Detailed Metrics**: Provides separate scores for accuracy and format for detailed analysis
**Configuration parameters:**
* `input_dataset`: Path to the GSM8K sample dataset
* `dataset_adapter`: Function that converts GSM8K format to EvaluationRow objects
* `model`: The model to evaluate (Fireworks Kimi model in this case)
* `rollout_input_params`: Model parameters (temperature set to 0.0 for deterministic results)
* `max_dataset_rows`: Limit to 5 test cases for quick evaluation
* `threshold_of_success`: Set to 0.0 to see all results (can be adjusted based on requirements)
* `rollout_processor`: Uses default single-turn processor for math problems
* `mode`: `pointwise` for evaluating individual rows since each row can be evaluated independently
* `evaluation_test_kwargs`: Additional parameters for the evaluation function
## Core Functions Explained
### `math_reward` Function
The `math_reward` function is a built-in evaluation function that extracts numerical answers from text and compares them with expected values. It's located in `eval_protocol.rewards.math`.
**Key Features:**
* **Extracts numbers** from both model responses and ground truth using sophisticated regex patterns
* **Supports multiple formats**: integers, decimals, fractions, scientific notation, LaTeX formatting
* **Configurable tolerance**: Handles floating-point precision issues with `tolerance` and `absolute_tolerance` parameters
* **Unit handling**: Can require or ignore units with the `require_units` parameter
* **Robust matching**: Finds the best match between extracted answers when multiple numbers are present
**Function Signature:**
```python
def math_reward(
messages: List[Message],
*,
ground_truth: str,
tolerance: float = 0.001,
absolute_tolerance: float = 1e-8,
require_units: bool = False,
**kwargs: Any,
) -> EvaluateResult:
```
**Parameters:**
* `messages`: List of conversation messages (extracts from the last assistant message)
* `ground_truth`: Expected answer string containing the correct numerical value
* `tolerance`: Relative tolerance for floating-point comparisons (default: 0.001)
* `absolute_tolerance`: Absolute tolerance for very small numbers (default: 1e-8)
* `require_units`: Whether to require units to match (default: False)
**Return Value:**
* `EvaluateResult` with score (1.0 for correct, 0.0 for incorrect) and detailed reasoning
**Example Usage:**
```python
result = math_reward(
messages=messages,
ground_truth="18",
tolerance=0.001,
absolute_tolerance=1e-8,
require_units=False
)
print(f"Score: {result.score}") # 1.0 if answer matches, 0.0 otherwise
print(f"Reason: {result.reason}") # Detailed explanation of the evaluation
```
### `check_think_answer_format` Function
This function validates that the model's response follows the required structured thinking format with `` and `` tags.
**Function Signature:**
```python
def check_think_answer_format(text: str) -> bool:
```
**Implementation Details:**
* Uses regex pattern `r"[\s\S]*?[\s\S]*?[\s\S]*?"`
* `[\s\S]*?`: Matches the thinking section with any content
* `[\s\S]*?`: Matches any characters (including newlines) between sections
* `[\s\S]*?`: Matches the answer section with any content
* Returns `True` if both sections are present in the correct order, `False` otherwise
**Example Valid Format:**
```
Let me solve this step by step:
1. Janet's ducks lay 16 eggs per day
2. She eats 3 for breakfast
3. She uses 4 for muffins
4. So she sells: 16 - 3 - 4 = 9 eggs
5. At $2 per egg, she makes: 9 * 2 = $18
Janet makes $18 every day at the farmers' market.
```
**Example Invalid Formats:**
* Missing `` section: `18`
* Missing `` section: `Step by step reasoning...`
* Wrong order: `18reasoning...`
* No tags: "The answer is 18"
### `gsm8k_to_evaluation_row` Function
This adapter function converts the GSM8K dataset format to the EP framework's expected `EvaluationRow` format.
**Function Signature:**
```python
def gsm8k_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
```
**Input Format:**
```python
[
{
"id": "gsm8k_test_0",
"user_query": "Janet's ducks lay 16 eggs per day...",
"ground_truth_for_eval": "Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs..."
},
# ... more entries
]
```
**Output Format:**
```python
[
EvaluationRow(
messages=[Message(role="user", content="Janet's ducks lay 16 eggs per day...")],
ground_truth="Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs..."
),
# ... more EvaluationRow objects
]
```
**Key Transformations:**
* Extracts `user_query` and creates a `Message` with role "user"
* Uses `ground_truth_for_eval` as the ground truth for comparison
* Creates `EvaluationRow` objects that the EP framework can process
* Maintains the original problem structure while adapting to EP's expected format
## Expected Model Response Format
For optimal evaluation, models should respond in this structured format:
```
Let me solve this step by step:
1. Janet's ducks lay 16 eggs per day
2. She eats 3 for breakfast
3. She uses 4 for muffins
4. So she sells: 16 - 3 - 4 = 9 eggs
5. At $2 per egg, she makes: 9 * 2 = $18
Janet makes $18 every day at the farmers' market.
```
**Format requirements:**
* `` section: Detailed step-by-step reasoning
* `` section: Clear final answer
* Both sections must be present for format compliance
* Numerical accuracy is evaluated from the final answer
## Evaluation Results
The evaluation provides comprehensive feedback:
**Successful Response:**
* **Score**: 1.0 (0.8 x 1.0 + 0.2 x 1.0 = 1.0)
* **Reason**: "Combined score: 1.00 (accuracy: 1.00, format: 1.00)"
* **Metrics**: Both accuracy and format scores are 1.0
**Correct Answer, Incorrect Format:**
* **Score**: 0.8 (0.8 x 1.0 + 0.2 x 0.0 = 0.8)
* **Reason**: "Combined score: 0.80 (accuracy: 1.00, format: 0.00)"
* **Metrics**: Accuracy score 1.0, format score 0.0
**Incorrect Answer, Correct Format:**
* **Score**: 0.2 (0.8 x 0.0 + 0.2 x 1.0 = 0.2)
* **Reason**: "Combined score: 0.20 (accuracy: 0.00, format: 1.00)"
* **Metrics**: Accuracy score 0.0, format score 1.0
This comprehensive evaluation ensures that models can:
1. Understand complex mathematical word problems
2. Perform accurate numerical calculations
3. Present solutions in a structured, readable format
4. Provide step-by-step reasoning for transparency
The GSM8K evaluation demonstrates how to create robust, multi-criteria assessments that can be used for model comparison, fine-tuning validation, and deployment readiness testing.
# Hallucination Detection Evaluation
Source: https://evalprotocol.io/example/hallucination-detection
Detect factual inaccuracies using LLM-as-judge to compare responses against ground truth knowledge
This example demonstrates how to create comprehensive hallucination detection evaluations using the Eval Protocol (EP) framework. The evaluation uses an LLM-as-judge approach to assess whether AI model responses contain factual inaccuracies by comparing them against provided ground truth knowledge.
You can find the complete code for this example at [test\_hallucination.py](https://github.com/eval-protocol/python-sdk/blob/main/tests/pytest/test_hallucination.py).
## Understanding Hallucination Detection Evaluation
Hallucination detection evaluation assesses whether AI models provide **factually accurate responses** that align with verified knowledge, rather than generating plausible-sounding but incorrect information. Unlike traditional accuracy metrics that focus on exact matches, this evaluation tests **factual consistency and truthfulness** - critical for building trustworthy AI systems.
## The HaluEval Dataset
This evaluation uses the **HaluEval QA dataset**, a comprehensive benchmark containing 10,000 question-answering samples specifically designed to test hallucination detection. The dataset is built on HotpotQA with Wikipedia knowledge and includes both correct answers and ChatGPT-generated plausible hallucinations.
### Dataset Structure
Each entry contains:
* **`knowledge`**: Wikipedia context providing factual background information
* **`question`**: Multi-hop reasoning question from HotpotQA requiring knowledge synthesis
* **`right_answer`**: Verified ground-truth answer from HotpotQA
* **`hallucinated_answer`**: ChatGPT-generated plausible but factually incorrect response
### Example Entry
```json
{
"knowledge": "Her self-titled debut studio album was released on 2 June 2017.\"New Rules\" is a song by English singer Dua Lipa from her eponymous debut studio album (2017).",
"question": "Dua Lipa, an English singer, songwriter and model, the album spawned the number-one single \"New Rules\" is a song by English singer Dua Lipa from her eponymous debut studio album, released in what year?",
"right_answer": "2017",
"hallucinated_answer": "The album was released in 2018."
}
```
**Sample Dataset**: The EP python-sdk includes a sample of **3 representative rows** from the HaluEval QA dataset for testing and demonstration purposes. The full HaluEval QA dataset contains 10,000 knowledge-question pairs with both correct and hallucinated answers, designed to test models' ability to distinguish factual accuracy from plausible misinformation.
## Step 1: Import Required Dependencies
First, we import the necessary modules from the EP framework and set up the LLM judge:
```python
import json
from typing import Any, Dict, List
from fireworks import LLM
from eval_protocol.models import EvaluateResult, EvaluationRow, Message, MetricResult
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
# Initialize the LLM judge for evaluation
judge_llm = LLM(model="accounts/fireworks/models/kimi-k2-instruct", deployment_type="serverless")
```
* `json`: For parsing LLM judge responses and handling structured data
* `typing`: Python's typing module for type hints
* `fireworks.LLM`: The LLM client for creating the judge model
* `EvaluateResult`, `EvaluationRow`, `Message`, `MetricResult`: Core EP data structures
* `default_single_turn_rollout_processor`: Default processor for single-turn conversations
* `evaluation_test`: Decorator for configuring evaluation tests
* `judge_llm`: Pre-configured LLM instance that serves as the factual accuracy judge
## Step 2: Create the Dataset Adapter
We need to convert the hallucination dataset format to the EP's expected format:
```python
def hallucination_dataset_adapter(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
"""
Convert HaluEval dataset to EvaluationRow objects.
This adapter combines the knowledge context with the question to create
a complete user message, and stores the correct answer as ground truth
for the LLM judge to use during evaluation.
Args:
data: List of hallucination dataset entries with knowledge, question, and right_answer
Returns:
List of EvaluationRow objects ready for evaluation
"""
return [
EvaluationRow(
messages=[Message(role="user", content=f"Knowledge: {item['knowledge']}\n\nQuestion: {item['question']}")],
ground_truth=item["right_answer"]
)
for item in data
]
```
## Step 3: Configure and Run the Evaluation
We use the `@evaluation_test` decorator to configure the hallucination detection evaluation:
```python
@evaluation_test(
input_dataset=["tests/pytest/data/halueval_sample_dataset.jsonl"],
dataset_adapter=hallucination_dataset_adapter,
completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct", "temperature": 0.0, "max_tokens": 512}],
rollout_processor=SingleTurnRolloutProcessor(),
passed_threshold=1.0,
num_runs=1,
mode="pointwise",
)
def test_hallucination_detection(row: EvaluationRow) -> EvaluationRow:
"""
Test for response correctness using LLM-as-judge.
This function:
1. Extracts the assistant's response to the knowledge-question pair
2. Uses an LLM judge to compare the response against the correct answer
3. Returns a binary score based on factual accuracy
Args:
row: EvaluationRow containing the conversation messages and correct answer in ground_truth
Returns:
EvaluationRow with the evaluation result
"""
messages = row.messages
assistant_response = messages[-1].content
if not assistant_response:
return EvaluateResult(score=0.0, reason="❌ No assistant response found")
correct_answer = row.ground_truth
# LLM judge system prompt for factual accuracy assessment
system_prompt = """
TASK
- You will be given an assistant's response and the correct answer.
- Your job is to evaluate whether the assistant's response is factually consistent with the correct answer.
- Grade whether the assistant got it right or wrong.
FORMAT
- Your response should be a JSON object with the following fields:
- `reasoning`: a short explanation for your classification
- `is_correct`: `true` if the assistant's response matches the correct answer, `false` otherwise
Example response structure:
{
"reasoning": "",
"is_correct":
}
"""
user_prompt = f"""
assistant_response:
{assistant_response}
correct_answer:
{correct_answer}
"""
try:
# Query the LLM judge for factual accuracy assessment
response = judge_llm.chat.completions.create(
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.1,
max_tokens=500,
)
result_data = json.loads(response.choices[0].message.content)
is_correct = result_data.get("is_correct", False)
reasoning = result_data.get("reasoning", "Could not parse reasoning")
except Exception as e:
# Fallback if LLM judge fails
is_correct = False
reasoning = f"Evaluation failed: {str(e)}"
score = 1.0 if is_correct else 0.0
if is_correct:
assessment = "✅ Response is correct"
else:
assessment = "❌ Response is incorrect"
reason = f"{assessment}\nReasoning: {reasoning}"
row.evaluation_result = EvaluateResult(
score=score,
reason=reason,
metrics={
"llm_judge": MetricResult(
score=score,
reason=reasoning,
is_score_valid=True
)
}
)
return row
```
**Configuration parameters:**
* `input_dataset`: Path to the HaluEval sample dataset JSONL file
* `model`: The model to evaluate for factual accuracy
* `rollout_input_params`: Model parameters with moderate token limit for concise responses
* `threshold_of_success`: 100% accuracy threshold (hallucinations should be completely avoided)
* `mode`: `pointwise` for evaluating individual knowledge-question pairs
* `dataset_adapter`: Function that converts HaluEval format to EvaluationRow objects
* `rollout_processor`: Uses default single-turn processor
**Evaluation process:**
1. **Response extraction**: Get the assistant's answer to the knowledge-question pair
2. **Judge preparation**: Set up LLM judge with clear evaluation criteria
3. **Factual comparison**: Use judge to compare assistant response against correct answer
4. **Structured evaluation**: Judge provides reasoning and binary correctness assessment
5. **Score assignment**: Convert judge decision to numerical score (1.0 or 0.0)
## Core Functions Explained
### LLM-as-Judge System
The hallucination detection uses a sophisticated LLM judge to assess factual accuracy:
**Judge System Prompt Design:**
* **Clear task definition**: Explicitly states the factual consistency evaluation goal
* **Structured output**: Requires JSON format with reasoning and binary decision
* **Objective criteria**: Focuses on factual accuracy rather than style or completeness
* **Consistent format**: Standardizes judge responses for reliable parsing
**Judge Evaluation Process:**
```python
# The judge receives both responses for direct comparison
system_prompt = """
TASK
- You will be given an assistant's response and the correct answer.
- Your job is to evaluate whether the assistant's response is factually consistent with the correct answer.
- Grade whether the assistant got it right or wrong.
FORMAT
- Your response should be a JSON object with the following fields:
- `reasoning`: a short explanation for your classification
- `is_correct`: `true` if the assistant's response matches the correct answer, `false` otherwise
"""
```
**Advantages of LLM-as-Judge:**
* **Semantic understanding**: Can recognize factually equivalent statements with different wording
* **Context awareness**: Understands nuanced relationships between concepts
* **Flexible matching**: Handles partial answers and different levels of detail appropriately
* **Reasoning transparency**: Provides explanations for evaluation decisions
## Evaluation Scenarios and Results
The hallucination detection evaluation handles various factual accuracy scenarios:
### Factually Correct Response (Score: 1.0)
**Scenario**: Model provides accurate information consistent with the knowledge
```python
# Knowledge: "The speed of light in vacuum is approximately 299,792,458 meters per second..."
# Question: "What is the speed of light in vacuum?"
# Model response: "The speed of light in vacuum is approximately 299,792,458 m/s."
# Correct answer: "The speed of light in vacuum is approximately 299,792,458 meters per second."
# Judge reasoning: "The assistant's response is factually accurate. While it uses 'm/s' instead of 'meters per second', both represent the same unit and the numerical value is correct."
# Result: ✅ Response is correct
```
### Factual Inaccuracy (Score: 0.0)
**Scenario**: Model provides incorrect information
```python
# Knowledge: "The Berlin Wall was constructed in 1961..."
# Question: "When was the Berlin Wall built?"
# Model response: "The Berlin Wall was built in 1959."
# Correct answer: "The Berlin Wall was built in 1961."
# Judge reasoning: "The assistant provided an incorrect date. The Berlin Wall was built in 1961, not 1959."
# Result: ❌ Response is incorrect
```
## Conclusion
This hallucination detection evaluation demonstrates how to assess AI models' factual accuracy using LLM-as-judge methodology. The evaluation ensures models can provide truthful, accurate responses based on provided knowledge without introducing false information.
This evaluation is particularly valuable for:
* **Factual accuracy assessment**: Testing models' ability to stay grounded in provided knowledge
* **Trustworthiness validation**: Ensuring AI systems provide reliable, accurate information
* **Knowledge-based applications**: Validating models for use in educational or informational contexts
The hallucination detection evaluation focuses on **factual consistency and truthfulness** rather than stylistic preferences, making it essential for building reliable AI systems that users can trust for accurate information. It provides objective assessment through LLM judges with detailed reasoning and handles diverse knowledge domains comprehensively.
# HealthBench (Open-Resource)
Source: https://evalprotocol.io/example/healthbench
Tiny, rubric-keyword proxy for clinical safety/quality signals
This example provides a minimal, rubric-driven proxy inspired by HealthBench—for quick sanity checks in clinical-style prompts. It is not a comprehensive or official reimplementation.
This example is now implemented as a suite in `eval_protocol/benchmarks/suites/healthbench.py` and exported as `healthbench`.
## What it does
* Uses a few in-memory prompts with small rubric lists.
* Extracts simple keyword requirements from rubric criteria (e.g., “hospital”, “urgent”, “hydration”, “rest”).
* Scores 1.0 if the assistant’s response contains any required rubric keywords; otherwise 0.0.
## How it’s configured
* `@evaluation_test` sets a small temperature and token budget.
* Messages are constructed inline; rubrics are mapped by prompt string.
## Run it locally
After installing eval-protocol, you can run the benchmark from anywhere:
```bash
pytest --pyargs eval_protocol.benchmarks.test_healthbench -v \
--ep-print-summary --ep-summary-json artifacts/healthbench.json
```
## Notes
* This is a minimal proxy to surface safety/quality cues—not a validated clinical benchmark.
* You can expand the rubric list or keyword extraction as needed for your domain.
# JSON Schema Evaluation
Source: https://evalprotocol.io/example/json-schema
Evaluate JSON content against expected schemas using structural similarity and LLM judgment
This example demonstrates how to create comprehensive JSON schema evaluations
using the Eval Protocol (EP) framework. The evaluation uses the
[`json_schema_reward`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/rewards/json_schema.py)
function to assess whether models generate JSON content that matches expected
schemas, with options for both structural validation and LLM-based judgment.
You can find the complete code for this example at [test\_pytest\_json\_schema.py](https://github.com/eval-protocol/python-sdk/blob/main/tests/pytest/test_pytest_json_schema.py).
## Understanding JSON Schema Evaluation
JSON schema evaluation assesses a model's ability to:
* **Generate valid JSON**: Produce syntactically correct JSON content
* **Match expected structure**: Create JSON objects that conform to specified schemas
* **Handle complex nested structures**: Work with objects, arrays, and mixed data types
* **Extract JSON from responses**: Parse JSON content from markdown code blocks or plain text
* **Validate type consistency**: Ensure data types match schema specifications
Unlike text-based evaluations that focus on content generation, JSON schema evaluations test a model's **structured data generation capabilities** - essential skills for AI systems that need to produce machine-readable, structured outputs.
## Understanding the Dataset Structure
The JSON schema dataset contains diverse test cases that evaluate different aspects of JSON generation, from simple object creation to complex nested structures with various data types.
### Dataset Format
Each entry in the dataset contains:
* **`messages`**: Conversation history with user requests and assistant responses
* **`ground_truth`**: Optional expected response (not used in schema validation)
* **`evaluation_result`**: Pre-computed evaluation scores for validation
* **`input_metadata`**: Additional context including expected schema and test case descriptions
### Example Dataset Entries
**Perfect Schema Match:**
````json
{
"messages": [
{
"role": "user",
"content": "Give me user information in the following JSON format:\n```json\n{\n \"user\": {\n \"name\": \"string\",\n \"age\": \"number\",\n \"isActive\": \"boolean\"\n },\n \"products\": [\n {\n \"id\": \"string\",\n \"price\": \"number\"\n }\n ]\n}\n```"
},
{
"role": "assistant",
"content": "Here's the user information:\n```json\n{\n \"user\": {\n \"name\": \"John Doe\",\n \"age\": 30,\n \"isActive\": true\n },\n \"products\": [\n {\n \"id\": \"prod-1\",\n \"price\": 29.99\n }\n ]\n}\n```"
}
],
"ground_truth": null,
"input_metadata": {
"dataset_info": {
"test_case": "exact_match",
"description": "JSON content that exactly matches the expected schema",
"expected_schema": {
"type": "object",
"properties": {
"user": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
"isActive": {"type": "boolean"}
}
},
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {"type": "string"},
"price": {"type": "number"}
}
}
}
}
}
}
}
}
````
**Partial Schema Match with Missing Properties:**
````json
{
"messages": [
{
"role": "user",
"content": "Give me user information in the following JSON format:\n```json\n{\n \"user\": {\n \"name\": \"string\",\n \"age\": \"number\",\n \"isActive\": \"boolean\",\n \"address\": {\n \"street\": \"string\",\n \"city\": \"string\"\n }\n }\n}\n```"
},
{
"role": "assistant",
"content": "Here's the user information:\n```json\n{\n \"user\": {\n \"name\": \"John Doe\",\n \"age\": 30,\n \"email\": \"john@example.com\"\n }\n}\n```"
}
],
"ground_truth": null,
"input_metadata": {
"dataset_info": {
"test_case": "partial_match",
"description": "JSON content with missing and extra properties compared to expected schema",
"expected_schema": {
"type": "object",
"properties": {
"user": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
"isActive": {"type": "boolean"},
"address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"}
}
}
}
}
}
}
}
}
}
````
**Type Mismatch Errors:**
````json
{
"messages": [
{
"role": "user",
"content": "Give me user information in the following JSON format:\n```json\n{\n \"name\": \"string\",\n \"age\": \"number\",\n \"tags\": [\"string\"]\n}\n```"
},
{
"role": "assistant",
"content": "Here's the user information:\n```json\n{\n \"name\": \"John Doe\",\n \"age\": \"30\",\n \"tags\": {\n \"tag1\": \"value1\"\n }\n}\n```"
}
],
"ground_truth": null,
"input_metadata": {
"dataset_info": {
"test_case": "mismatched_types",
"description": "JSON content with type mismatches (string instead of number, object instead of array)",
"expected_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
"tags": {
"type": "array",
"items": {"type": "string"}
}
}
}
}
}
}
````
## Step 1: Import Required Dependencies
First, we import the necessary modules from the EP framework:
```python
import json
from typing import Any, Dict, List
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
from eval_protocol.rewards.json_schema import json_schema_reward
```
* `json`: Python's JSON module for JSON parsing and validation
* `typing`: Python's typing module for type hints (Any, Dict, List)
* `EvaluationRow`: Data structure containing conversation messages and ground truth
* `default_single_turn_rollout_processor`: Default processor for single-turn conversations
* `evaluation_test`: Decorator for configuring evaluation tests
* `json_schema_reward`: Function to evaluate JSON content against expected schemas
## Step 2: Create the Dataset Adapter
We need to convert the JSON schema dataset format to the EP's expected format:
```python
def json_schema_to_evaluation_row(rows: List[Dict[str, Any]]) -> List[EvaluationRow]:
"""
Convert a json schema row to an evaluation row.
This adapter extracts the conversation messages and metadata from the dataset,
creating EvaluationRow objects that can be processed by the evaluation framework.
Args:
rows: List of JSON schema dataset entries with messages and metadata
Returns:
List of EvaluationRow objects ready for evaluation
"""
dataset: List[EvaluationRow] = []
for row in rows:
dataset.append(
EvaluationRow(
messages=row["messages"][:1], # Use only the first message (user prompt)
ground_truth=row["ground_truth"],
input_metadata=row["input_metadata"],
)
)
return dataset
```
The adapter function:
* **Extracts conversation messages**: Takes the user prompt from the dataset
* **Preserves metadata**: Maintains the expected schema and test case information
* **Handles ground truth**: Passes through any ground truth data (though not used in schema validation)
* **Creates evaluation rows**: Converts dataset entries to the EP's standard format
## Step 3: Configure the Evaluation Test
We use the `@evaluation_test` decorator to configure our JSON schema evaluation:
```python
@evaluation_test(
input_dataset=["tests/pytest/data/json_schema.jsonl"],
completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct"}],
mode="pointwise",
rollout_processor=SingleTurnRolloutProcessor(),
dataset_adapter=json_schema_to_evaluation_row,
)
async def test_pytest_function_calling(row: EvaluationRow) -> EvaluationRow:
"""Run pointwise evaluation on sample dataset using pytest interface."""
expected_schema = row.input_metadata.dataset_info["expected_schema"]
result = json_schema_reward(row.messages, expected_schema=expected_schema)
row.evaluation_result = result
print(row.evaluation_result)
return row
```
The evaluation configuration:
* **`input_dataset`**: Path to the JSON schema dataset file
* **`model`**: Target model to evaluate (Fireworks Kimi model in this example)
* **`mode`**: Set to "pointwise" for individual sample evaluation
* **`rollout_processor`**: Uses default single-turn processor for conversation handling
* **`dataset_adapter`**: References our custom adapter function
## Step 4: Implement the Evaluation Logic
The core evaluation logic extracts the expected schema and applies the JSON schema reward function:
```python
async def test_pytest_function_calling(row: EvaluationRow) -> EvaluationRow:
"""Run pointwise evaluation on sample dataset using pytest interface."""
# Extract the expected schema from the dataset metadata
expected_schema = row.input_metadata.dataset_info["expected_schema"]
# Apply the JSON schema reward function
result = json_schema_reward(row.messages, expected_schema=expected_schema)
# Store the evaluation result
row.evaluation_result = result
print(row.evaluation_result)
return row
```
The evaluation process:
1. **Extracts expected schema**: Gets the target JSON structure from metadata
2. **Applies schema validation**: Uses `json_schema_reward` to compare generated JSON against expected schema
3. **Stores results**: Saves the evaluation score and metrics in the row
4. **Returns processed row**: Provides the evaluated row for further analysis
## Understanding the JSON Schema Reward Function
The [`json_schema_reward`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/rewards/json_schema.py) function provides comprehensive JSON validation capabilities:
### Core Features
**Schema Extraction and Normalization:**
* Extracts JSON content from assistant responses (supports markdown code blocks)
* Normalizes schemas for consistent comparison
* Handles both object and string schema representations
**Structural Similarity Calculation:**
* Uses Jaccard similarity to compare schema structures
* Evaluates property matches, type consistency, and nested object alignment
* Provides detailed scoring with property-level analysis
**Error Handling:**
* Validates JSON syntax before schema comparison
* Handles malformed JSON with appropriate error scoring
* Provides clear error messages for debugging
## Test Cases and Evaluation Scenarios
The JSON schema evaluation covers various scenarios:
### ✅ Perfect Matches
| Scenario | Description |
| :-------------------------- | :----------------------------------------------- |
| **Exact schema compliance** | JSON that perfectly matches expected structure |
| **Type consistency** | All data types match schema specifications |
| **Nested object handling** | Complex nested structures with proper validation |
### ⚠️ Partial Matches
| Scenario | Description |
| :--------------------- | :---------------------------------------- |
| **Missing properties** | JSON with some expected fields omitted |
| **Extra properties** | JSON with additional fields not in schema |
| **Type mismatches** | Correct structure but wrong data types |
### ❌ Error Cases
| Scenario | Description |
| :----------------------- | :-------------------------------------- |
| **Invalid JSON syntax** | Malformed JSON that cannot be parsed |
| **Missing JSON content** | Responses without extractable JSON |
| **Empty structures** | Edge cases with empty objects or arrays |
### 🔄 Complex Scenarios
| Scenario | Description |
| :------------------- | :--------------------------------------------------- |
| **Array validation** | JSON arrays with consistent item structures |
| **Mixed data types** | Objects with various primitive and complex types |
| **Nested arrays** | Multi-level nested structures with arrays of objects |
## Expected Output
The evaluation produces detailed results including:
**Perfect Match Example:**
```
EvaluationResult(
score=1.0,
reason="Perfect schema match",
metrics={
"schema_similarity": MetricResult(
score=1.0,
reason="Schema similarity: 1.00",
is_score_valid=True
)
}
)
```
**Partial Match Example:**
```
EvaluationResult(
score=0.5,
reason="Partial schema match with missing and extra properties",
metrics={
"schema_similarity": MetricResult(
score=0.5,
reason="Schema similarity: 0.50",
is_score_valid=False
)
}
)
```
**Error Case Example:**
````
EvaluationResult(
score=0.0,
reason="Invalid JSON content",
metrics={
"error": MetricResult(
score=0.0,
reason="Invalid JSON content: Here's the user information:\n```json\n{\n \"name\": \"John Doe\",\n \"age\": \n}\n```",
is_score_valid=False
)
}
)
````
## Conclusion
This JSON schema evaluation demonstrates how to assess AI models' structured data generation capabilities using schema validation and similarity scoring. The evaluation ensures models can generate valid JSON content that conforms to expected schemas, handle complex nested structures, and maintain type consistency.
This evaluation is particularly valuable for:
* **API integration testing**: Validating JSON responses from AI models that interact with external APIs
* **Data pipeline validation**: Ensuring structured data generation meets schema requirements
* **Model capability assessment**: Evaluating language models' ability to produce machine-readable outputs
The JSON schema evaluation focuses on **structural correctness** and **type compliance** rather than semantic content, making it essential for building reliable AI systems that can generate consistent, well-formed JSON data. It provides objective scoring with detailed property-level analysis, comprehensive error handling, and scalable automated validation.
This comprehensive JSON schema evaluation framework provides robust assessment
of model capabilities in structured data generation, essential for applications
requiring reliable JSON output from AI systems.
# LiveBench — Data Analysis
Source: https://evalprotocol.io/example/livebench-data-analysis
CTA, Table Join, and Table Reformat tasks with lightweight scoring ports
This example showcases three LiveBench Data Analysis tasks wired into Eval Protocol with minimal scoring ports adapted from the original benchmark: CTA, Table Join, and Table Reformat.
Suites live in the Python SDK under `eval_protocol/benchmarks/suites/livebench_data_analysis.py` and are exported as runnable benchmarks.
## What it includes
* CTA: case-insensitive exact/suffix match over cleaned strings
* Table Join: F1 over key-value mappings recovered from model output
* Table Reformat: strict table equivalence with parser fallbacks; version auto-selects by release date
## Run from CLI (exported benchmark)
After installing eval-protocol, you can run the composite benchmark from anywhere:
```bash
pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis -v \
--ep-print-summary \
--ep-summary-json artifacts/live_bench_data_analysis.json
```
This composite benchmark aggregates the three tasks with a final combined summary.
## Run each task individually
```bash
pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_cta -v \
--ep-print-summary --ep-summary-json artifacts/cta.json
pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_tablejoin -v \
--ep-print-summary --ep-summary-json artifacts/tablejoin.json
pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_tablereformat -v \
--ep-print-summary --ep-summary-json artifacts/tablereformat.json
```
## Notes
* Uses `datasets` to pull `livebench/data_analysis` at import time.
* Scoring is intentionally lightweight and aims for compatibility with LiveBench behavior (e.g., tolerant parsing, suffix matches, and defensive fallbacks), not an official reproduction.
# Image Multi-Turn Eval with Per-Step Rewards (Lunar Lander)
Source: https://evalprotocol.io/example/lunar-lander
If you haven't read through [Multi-turn eval (per-step rewards)](/tutorial/multi-turn-eval-per-step-rewards) yet, we recommend checking that out first as this tutorial builds on those foundational concepts.
This tutorial demonstrates how to create **multimodal** multi-turn reinforcement learning evaluations with visual observations and per-step rewards using the classic Lunar Lander environment. Unlike text-based RL environments like Frozen Lake, this example showcases how agents can process **both visual input (rendered game frames) and numerical state data** while receiving detailed per-step reward signals for landing performance, fuel efficiency, and trajectory optimization.
You can find the complete code for this example at [test\_lunar\_lander.py](https://github.com/eval-protocol/python-sdk/blob/main/tests/pytest/test_lunar_lander.py).
## Understanding the Lunar Lander Environment
Lunar Lander is a classic physics-based RL environment where an agent controls a spacecraft landing on the moon, requiring both visual understanding and precise control.
* **Action Space**: `Discrete(4)` - NOTHING (0), FIRE\_LEFT (1), FIRE\_MAIN (2), FIRE\_RIGHT (3)
* **Observation Space**: `Box(8)` - \[x, y, velocity\_x, velocity\_y, angle, angular\_velocity, leg1\_contact, leg2\_contact]
* **Visual Component**: 400x600 RGB rendered frames showing the lander, moon surface, and landing flags

**Complex Reward Structure**: Unlike Frozen Lake's sparse binary rewards, Lunar Lander provides detailed per-step feedback:
* Distance to landing pad (closer = better)
* Velocity penalties (slower = better)
* Angle penalties (more horizontal = better)
* +10 points per leg touching ground
* Fuel consumption penalties (-0.03 for side engines, -0.3 for main engine)
* Final outcome: +100 for successful landing, -100 for crash
**Success Criteria**: Episodes scoring ≥200 points are considered successful landings.
## Understanding the Dataset Structure
The Lunar Lander dataset demonstrates **multimodal prompting** - agents must analyze both numerical state and visual information to make decisions.
### Example Dataset Entry
```json
{
"id": "multi_env_test_001",
"system_prompt": "You are controlling a lunar lander spacecraft. Use the lander_action tool with actions: NOTHING, FIRE_LEFT, FIRE_MAIN, FIRE_RIGHT. Your goal is to land safely on the moon between the two flags without crashing.",
"user_prompt_template": "Current state: {observation}. First, describe what is in the image attached and analyze the current state. You MUST explain your reasoning in picking the next best action (NOTHING, FIRE_LEFT, FIRE_MAIN, FIRE_RIGHT) and call lander_action tool with it to land the spacecraft.",
"environment_context": {
"game": "LunarLander",
"continuous": false,
"gravity": -10.0,
"enable_wind": false,
"seed": 42
}
}
```
**Key Features:**
* **Visual Analysis Required**: "describe what is in the image attached"
* **State Analysis**: Both numerical state data and visual information
* **Tool Integration**: Structured interaction through `lander_action` tool
## Test Harness Architecture
The architecture is similar to Frozen Lake's in the sense that we again extend `McpGym` and create an `EnvironmentAdapter`, but there are some key differences.
### MCP Server: LunarLanderMcp
The `LunarLanderMcp` class extends `McpGym` with **visual rendering capabilities** in `format_observation`:
```python
class LunarLanderMcp(McpGym):
"""LunarLander production server with visual rendering support."""
def __init__(self, seed: Optional[int] = None):
self.adapter = LunarLanderAdapter()
super().__init__("LunarLander-v3", self.adapter, seed)
def _register_tools(self):
@self.mcp.tool(
name="lander_action",
description="Control the lunar lander with discrete actions. "
"Valid actions: NOTHING, FIRE_LEFT, FIRE_MAIN, FIRE_RIGHT."
)
def lander_action(action: str, ctx: Context) -> Dict[str, Any]:
# Parse and validate action
action_int = self.adapter.parse_action(action)
# Execute step with session management
session_id = self._get_session_id(ctx)
observation_data = self._execute_session_environment_step(session_id, action_int)
return observation_data
def format_observation(self, obs: Any, env: Any) -> Dict[str, Any]:
"""Format observation with both numerical data AND visual frame."""
# Structured numerical data
formatted = self.adapter.format_observation(obs)
# Add rendered visual frame
rendered_frame = self.adapter.render_frame(env)
if rendered_frame:
formatted["image_url"] = {
"url": rendered_frame # Base64 encoded PNG
}
return formatted
```
### Environment Adapter: LunarLanderAdapter
The `LunarLanderAdapter` acts as an adapter to the Gymnasium library's implementation of the LunarLander game, which includes both the **physics simulation and visual rendering**:
```python
class LunarLanderAdapter(EnvironmentAdapter):
"""LunarLander adapter with multimodal observation support."""
def __init__(self):
self.action_map = {
"NOTHING": 0, "FIRE_LEFT": 1,
"FIRE_MAIN": 2, "FIRE_RIGHT": 3
}
def format_observation(self, obs: np.ndarray) -> Dict[str, Any]:
"""Convert 8D observation vector to structured data."""
return {
"position": {"x": float(obs[0]), "y": float(obs[1])},
"velocity": {"x": float(obs[2]), "y": float(obs[3])},
"orientation": {"angle": float(obs[4]), "angular_velocity": float(obs[5])},
"legs": {"left_contact": bool(obs[6]), "right_contact": bool(obs[7])},
}
def render_frame(self, env: LunarLander) -> Optional[str]:
"""Render visual frame as base64 encoded image."""
rgb_array = env.render()
if rgb_array is None:
return None
# Convert to PIL Image and encode as base64
image = Image.fromarray(rgb_array.astype(np.uint8))
buffer = io.BytesIO()
image.save(buffer, format="PNG")
return f"data:image/png;base64,{base64.b64encode(buffer.getvalue()).decode('utf-8')}"
```
## Pytest Implementation
### Step 1: Dataset Adapter
```python
def lunar_lander_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
"""Convert lunar lander entries to EvaluationRow objects."""
rows = []
for row in data:
eval_row = EvaluationRow(
messages=[Message(role="system", content=row["system_prompt"])],
input_metadata=InputMetadata(
row_id=row["id"],
dataset_info={
"environment_context": row["environment_context"],
"user_prompt_template": row["user_prompt_template"],
}
)
)
rows.append(eval_row)
return rows
```
### Step 2: Test Configuration
```python
@evaluation_test(
input_dataset=["tests/pytest/data/lunar_lander_dataset.jsonl"],
dataset_adapter=lunar_lander_to_evaluation_row,
completion_params=[{"model": "gpt-4.1", "temperature": 0.0, "max_tokens": 4096}], # Vision-capable model required
rollout_processor=MCPGymRolloutProcessor(),
passed_threshold=0.0,
num_runs=1,
mode="pointwise",
server_script_path="examples/lunar_lander_mcp/server.py",
steps=15,
)
```
**Key Configuration Notes:**
* **Vision Model Required**: `gpt-4.1` or other vision-capable models
* **Same Rollout Processor**: Reuses `default_mcp_gym_rollout_processor` from Frozen Lake, demonstrating framework generalization across text and visual environments
* **Episode Management**: `steps=15` is not enough for the Lunar Lander game to complete, it likely would take hundreds of steps.
### Step 3: Trajectory Evaluation
As defined by the game, a success is if a score of 200 or over is achieved, which is then converted to 1 or 0 to signify a pass or fail in our Pytest setup.
```python
def test_lunar_lander_evaluation(row: EvaluationRow) -> EvaluationRow:
"""Evaluate lunar lander performance using physics-based scoring."""
# Get cumulative reward from entire visual trajectory
score = row.get_total_reward()
# Apply Lunar Lander success criterion
evaluation_score = 1.0 if score >= 200 else 0.0
reason = (f"✅ Successful landing with reward {score:.2f}" if score >= 200
else f"❌ Failed landing with reward {score:.2f}")
row.evaluation_result = EvaluateResult(
score=evaluation_score,
reason=reason,
)
return row
```
## Conclusion
This Lunar Lander tutorial showcases eval-protocol's **multimodal evaluation capabilities**, demonstrating how the framework seamlessly handles complex visual RL environments while maintaining the same architectural patterns established with text-based evaluations. The key innovation is the **dual-stream observation system**: agents receive both structured numerical data and visual frames, enabling sophisticated multimodal reasoning about physics, control, and spatial relationships.
The per-step reward structure in Lunar Lander is particularly valuable for training data generation. Unlike Frozen Lake's sparse rewards, every frame provides rich feedback about landing performance, fuel efficiency, and trajectory optimization. This creates **dense multimodal training signals** that can inform visual RL algorithms, multimodal fine-tuning approaches, and hybrid training systems that combine visual understanding with control policy learning. In the future, we hope to extend this work to frontier LLM use-cases like browser-use agents.
Most importantly, this example demonstrates eval-protocol's **modality-agnostic design**. The same `default_mcp_gym_rollout_processor`, pytest patterns, and evaluation infrastructure work seamlessly across text-based grid worlds and complex visual physics simulations. This unified approach enables practitioners to build comprehensive evaluation suites spanning the full spectrum of AI capabilities—from language understanding to visual reasoning to real-time control—all within a single, consistent framework.
# SVG Generation Evaluation
Source: https://evalprotocol.io/example/svg-generation
Evaluate AI models ability to generate SVG code that meets specific visual requirements using automated rendering and LLM judge scoring
This example demonstrates how to create comprehensive SVG generation evaluations using the Eval Protocol (EP) framework. The evaluation combines automated SVG rendering with LLM judge assessment to measure how well models can generate visual content that meets specific requirements.
You can find the complete code for this example at [test\_svgbench.py](https://github.com/eval-protocol/python-sdk/blob/main/tests/pytest/test_svgbench.py).
## Understanding SVG Generation Evaluation
SVG generation evaluation assesses a model's ability to:
* **Interpret visual requirements**: Understand textual descriptions of visual elements
* **Generate valid SVG code**: Create syntactically correct SVG markup
* **Meet specific criteria**: Fulfill detailed visual requirements like colors, shapes, positions
* **Follow formatting conventions**: Use proper SVG code block formatting
Unlike traditional text evaluations, SVG generation testing evaluates **visual creativity and technical precision** - essential capabilities for AI systems that need to create graphical content, diagrams, icons, and visual representations.
## Understanding the Dataset Structure
The SVG generation dataset contains diverse test cases that evaluate different aspects of visual content creation, from simple geometric shapes to complex multi-element compositions.
### Dataset Format
Each entry in the dataset contains:
* **`id`**: Unique identifier for the test case
* **`prompt`**: Base textual description of what to create
* **`requirements`**: List of specific visual criteria that must be met
* **`total_requirements`**: Number of requirements for scoring normalization
### Example Dataset Entry
**Complex UI Recreation - Google Homepage:**
```json
{
"id": "google_homepage",
"prompt": "Write `svg` code for a screenshot of the [Google homepage](https://google.com).",
"requirements": [
"The overall background of the SVG must be white",
"All primary elements must be horizontally centered on the canvas",
"Include the Google logo in the center, using its official multi-color scheme (blue, red, yellow, blue, green, red)",
"Place a prominent search bar directly below the Google logo",
"The search bar must be a rounded rectangle with a light gray border",
"The search bar must contain a gray magnifying glass icon on the left side",
"The search bar must contain a gray microphone icon on the right side",
"Place two distinct buttons below the search bar",
"The left button must be labeled 'Google Search'",
"The right button must be labeled 'I'm Feeling Lucky'",
"Buttons should have a light gray background, a thin border, and dark gray text",
"Create a header section at the top right of the canvas",
"The header must include text links for 'Gmail' and 'Images'",
"The header must include a 3x3 grid icon (Google Apps launcher)",
"The header must include a prominent 'Sign in' button, typically with a blue background and white text"
]
}
```
### Dataset Characteristics
**Requirement Categories**:
* **Structural**: Presence of specific shapes, elements, or text
* **Aesthetic**: Colors, proportions, visual balance, style consistency
* **Technical**: SVG formatting, dimensions, code validity
* **Functional**: Scalability, accessibility, professional appearance
**Evaluation Approach**:
* **Automated rendering**: SVG to PNG conversion using Selenium WebDriver
* **LLM judge scoring**: GPT-4.1 vision model evaluates requirement fulfillment
* **Ratio-based scoring**: Score = fulfilled\_requirements / total\_requirements
## Step 1: Import Required Dependencies
First, we import the necessary modules for SVG evaluation:
```python
import base64
import json
import logging
import os
import re
import tempfile
from typing import Any, Dict, List, Optional
import litellm
from pydantic import BaseModel
from eval_protocol.models import EvaluateResult, EvaluationRow, InputMetadata, Message
from eval_protocol.pytest import evaluation_test, SingleTurnRolloutProcessor
```
Key dependencies:
* `base64`: For encoding rendered images for LLM judge evaluation
* `litellm`: For calling the GPT-4.1 vision model as LLM judge
* `selenium`: For automated SVG to PNG rendering (imported conditionally)
* `pydantic`: For structured response validation from LLM judge
* Standard EP framework components for evaluation structure
## Step 2: Create the Dataset Adapter
We need to convert the SVG dataset format to the EP's expected format:
```python
def svgbench_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
"""
Convert SVGBench dataset entries to EvaluationRow objects.
This adapter formats the visual requirements as a numbered list and creates
a proper generation prompt that includes formatting instructions and
specific requirements for the SVG generation task.
Args:
data: List of dictionaries containing prompt and requirements
Returns:
List of EvaluationRow objects ready for evaluation
"""
rows = []
for row in data:
# Format requirements as numbered list
requirements = "\n".join([f"{i+1}. {req}" for i, req in enumerate(row["requirements"])])
# Create the generation prompt following SVGBench format
prompt = f"""{row['prompt']} Wrap the SVG code in an SVG code block following the example below.
Requirements:
{requirements}"""
eval_row = EvaluationRow(
messages=[Message(role="user", content=prompt)],
input_metadata=InputMetadata(
row_id=row["id"],
dataset_info={
"original_prompt": row["prompt"],
"requirements": row["requirements"],
"total_requirements": len(row["requirements"]),
"formatted_prompt": prompt,
},
),
)
rows.append(eval_row)
return rows
```
This adapter:
* Formats visual requirements as a clear numbered list
* Provides SVG code block formatting instructions with examples
* Preserves original prompt and requirements for evaluation reference
* Creates structured metadata for scoring calculations
## Step 3: Implement SVG Code Extraction
Extract SVG code from model responses with robust parsing:
````python
def extract_svg_code(text: str) -> Optional[str]:
"""
Extract SVG code from model response using multiple fallback strategies.
This function handles various ways models might format SVG code:
- Standard ```svg code blocks
- Raw tags in text
- Mixed formatting approaches
Args:
text: Raw model response text
Returns:
Extracted SVG code or None if not found
"""
# First try: Look for ```svg code blocks
if "```svg" in text:
svg_parts = text.split("```svg")
if len(svg_parts) > 1:
svg_code = svg_parts[1].split("```")[0].strip()
return svg_code
# Second try: Look for tags
if "