@evaluation_test
decorator is the core component for creating pytest-based evaluation tests in the Evaluation Protocol. It enables you to evaluate AI models by running rollouts and applying evaluation criteria to measure performance.
Key Concepts
Before diving into the API, it’s important to understand the terminology used in the Evaluation Protocol:- Invocation: A single execution of a test function that can generate 1 or more experiments
- Experiment: A group of runs for a combination of parameters (multiple experiments if
num_runs > 1
) - Run: A group of rollouts (multiple run IDs if
num_runs > 1
) - Rollout: The execution/process that produces a trajectory
- Trajectory: The result produced by a rollout — a list of OpenAI Chat Completion messages
- Row: Both input and output of an evaluation (e.g., a task within a dataset)
- Dataset: A collection of rows (List[EvaluationRow])
- Eval: A rubric implemented in the test function body that produces a score from 0 to 1
Basic Usage
Parameters
No single parameter is strictly required. Providecompletion_params
whenever your rollout processor performs model calls (e.g., SingleTurnRolloutProcessor
).
Generation parameters for the rollout. The required fields depend on the rollout processor used:For SingleTurnRolloutProcessor and AgentRolloutProcessor:
- Must include a
model
field using a LiteLLM-compatible provider route (e.g.,openai/gpt-4o
,anthropic/claude-3-sonnet
,fireworks_ai/*
) - Optional:
temperature
,max_tokens
,extra_body
, etc. - See the LiteLLM providers list for supported prefixes and models: https://docs.litellm.ai/docs/providers
- Must include
model
field (the canonical way to pass model names to LLM clients) - Optional
provider
field (defaults to “openai” if not specified) - Example:
{"model": "accounts/fireworks/models/kimi-k2-instruct", "provider": "fireworks"}
- The agent factory uses the
model
field to create the appropriate Pydantic AI model
- Must include a
model
field using a LiteLLM-compatible provider route - Used to create the policy for environment interaction
- Can be any value (not used for actual model calls)
- Often set to
{"model": "not-used-offline"}
for clarity
Data loaders to produce evaluation rows. Preferred for reusable, parameterized inputs. Each loader may emit multiple variants; rows inherit metadata describing the loader, variant ID, and preprocessing state. Cannot be combined with
input_dataset
, input_messages
, or input_rows
.See Data Loader for details.Messages to send to the model. Useful when you don’t have a dataset but can hard-code messages. Will be passed as “input_dataset” to the test function.
Paths to JSONL datasets that will be loaded using
load_jsonl()
. Each path can be either a local file path or an HTTP/HTTPS URL. Provide a dataset_adapter
to convert the raw JSONL data to EvaluationRows.Behavior:- Files are loaded using
load_jsonl()
which reads JSONL format (one JSON object per line) - Supports both local files and HTTP URLs: Local file paths and HTTP/HTTPS URLs are both supported
- Robust parsing: Automatically skips blank or whitespace-only lines to handle trailing newlines gracefully
- Error handling: Provides detailed error messages including line numbers and row IDs when JSON parsing fails
- Timeout support: HTTP requests have a 30-second timeout
- When multiple paths are provided and
combine_datasets=True
(default), files are concatenated into one dataset - When
combine_datasets=False
, each path is parameterized into separate test invocations - Raw JSONL data is passed to the
dataset_adapter
function for conversion toEvaluationRow
format
- Local files:
"path/to/dataset.jsonl"
- HTTP URLs:
"http://example.com/dataset.jsonl"
- HTTPS URLs:
"https://example.com/dataset.jsonl"
Pre-constructed EvaluationRow objects to use directly. Useful when you already have messages and/or metadata prepared. Will be passed as “input_dataset” to the test function.Note: cannot be combined with
data_loaders
.Function to convert input dataset to a list of EvaluationRows. Defaults to
default_dataset_adapter
.Function used to perform the rollout. Defaults to
NoOpRolloutProcessor()
.Additional keyword arguments for the evaluation function.
Additional keyword arguments for the rollout processor.
How to aggregate scores across runs. One of: “mean”, “max”, “min”. Defaults to “mean”.
Optional preprocessing function applied to rows before rollout. Use this to expand multi-turn conversations (e.g.,
multi_turn_assistant_to_ground_truth
) or filter/transform rows.Note: when using data_loaders
, pass preprocess_fn
to the loader itself (e.g., DynamicDataLoader(preprocess_fn=...)
). When data_loaders
is provided, the decorator-level preprocess_fn
is not applied to avoid double-processing.Threshold configuration for test success. Can be a float or EvaluationThreshold object. Success rate must be above
success
, and if set, standard error must be below standard_error
.Number of times to repeat the rollout and evaluations. Defaults to 1.
Limit dataset to the first N rows.
Path to MCP config file that follows MCPMultiClientConfiguration schema.
Maximum number of concurrent rollouts to run in parallel. Defaults to 8.
Maximum number of concurrent evaluations to run in parallel. Defaults to 64.
Path to the MCP server script to run. Defaults to “examples/tau2_mcp/server.py”.
Number of rollout steps to execute. Defaults to 30.
Evaluation mode. “pointwise” (default) applies test function to each row individually. “groupwise” applies test function to a group of rollout results from the same original row (for use cases such as DPO/GRPO). “all” applies test function to the whole dataset.
Whether to combine multiple datasets. Defaults to True.
DatasetLogger to use for logging. If not provided, a default logger will be used.
Configuration for exception handling and backoff retry logic. If not provided, a default configuration will be used with common retryable exceptions. See the ExceptionHandlerConfig section below for detailed configuration options.
ExceptionHandlerConfig
TheExceptionHandlerConfig
parameter allows you to customize exception handling and retry logic for your evaluation tests. This configuration is defined in eval_protocol/pytest/exception_config.ExceptionHandlerConfig
.
Key Features
- Retryable Exceptions: Configure which exceptions should trigger retry attempts
- Backoff Strategies: Choose between exponential or constant backoff with configurable delays
- Environment Variable Overrides: Automatically respect
EP_MAX_RETRY
andEP_FAIL_ON_MAX_RETRY
settings - Custom Giveup Logic: Define custom conditions for when to stop retrying
Configuration Classes
ExceptionHandlerConfig
The main configuration class that controls exception handling behavior:BackoffConfig
Controls the retry backoff behavior:Default Configuration
By default, the following exceptions are considered retryable:- Standard library exceptions:
ConnectionError
,TimeoutError
,OSError
- Requests library exceptions:
requests.exceptions.ConnectionError
,requests.exceptions.Timeout
,requests.exceptions.HTTPError
,requests.exceptions.RequestException
- HTTPX library exceptions:
httpx.ConnectError
,httpx.TimeoutException
,httpx.NetworkError
,httpx.RemoteProtocolError
Backoff Strategies
Exponential Backoff (Default)
- Starts with
base_delay
and multiplies byfactor
each retry - Good for transient failures that may resolve quickly
- Example: 1s → 2s → 4s → 8s → 16s (capped at
max_delay
)
Constant Backoff
- Uses the same delay (
base_delay
) for all retries - Good for predictable, consistent retry timing
- Example: 2s → 2s → 2s → 2s
Environment Variable Integration
The configuration automatically respects these environment variables:EP_MAX_RETRY
: Overridesmax_tries
in BackoffConfigEP_FAIL_ON_MAX_RETRY
: Controlsraise_on_giveup
behavior
Example Usage
Basic Custom Configuration
Aggressive Retry Strategy
Conservative Retry Strategy
Custom Exception Handling
Evaluation Modes
Pointwise Mode (Default)
In pointwise mode, your test function processes each row individually, enabling pipelined evaluation:- Function must have a parameter named
row
of typeEvaluationRow
- Function must return
EvaluationRow
Groupwise Mode
In groupwise mode, your test function processes groups of rollout results from the same original row, useful for comparing different models or parameters:- Function must have a parameter named
rows
of typeList[EvaluationRow]
- Function must return
List[EvaluationRow]
- Must provide at least 2 completion parameters
All Mode
In all mode, your test function receives the entire dataset and processes all rows together:- Function must have a parameter named
rows
of typeList[EvaluationRow]
- Function must return
List[EvaluationRow]
Threshold Configuration
You can set thresholds for test success using thepassed_threshold
parameter:
Multiple Runs and Aggregation
Setnum_runs > 1
to run multiple evaluations and aggregate results:
Environment Variables
The decorator supports several environment variables for configuration:EP_MAX_DATASET_ROWS
: Overridemax_dataset_rows
parameter. Applies to both datasets andinput_messages
(slices to first N rows).EP_NUM_RUNS
: Override the number of runs for evaluation_test.EP_MAX_CONCURRENT_ROLLOUTS
: Override the maximum number of concurrent rollouts.EP_INPUT_PARAMS_JSON
: JSON object deep-merged intocompletion_params
. Example:{"temperature":0,"extra_body":{"reasoning":{"effort":"low"}}}
.EP_PRINT_SUMMARY
: Set to “1” to print a one-line evaluation summary to stdout.EP_SUMMARY_JSON
: File or directory path to write a JSON summary artifact. See “Summary artifacts” for naming behavior.- Retry-related environment variables are documented in the Retries and failure policy section.
Return Values
Your test function must return the appropriate type based on the mode:- Pointwise mode:
EvaluationRow
- Groupwise mode:
List[EvaluationRow]
- All mode:
List[EvaluationRow]
evaluation_result.score
: A float between 0 and 1- Optional
evaluation_result.metrics
: Additional metric scores
Dataset loading and input formats
- Data loaders (
data_loaders
): Preferred for reusable and parameterized inputs. Accepts one or moreEvaluationDataLoader
instances (e.g.,DynamicDataLoader
,InlineDataLoader
). Each loader can emit multiple variants and applypreprocess_fn
internally. Cannot be combined withinput_dataset
,input_messages
, orinput_rows
. - Datasets (
input_dataset
): You can pass a single path or a list of paths to JSONL files. Files are loaded usingload_jsonl()
which supports both local files and HTTP/HTTPS URLs. The function reads JSONL format (one JSON object per line) with robust error handling, automatically skips blank lines, and provides detailed error messages with line numbers and row IDs. When a list is provided andcombine_datasets=True
(default), files are concatenated into one dataset; whencombine_datasets=False
, each path is parameterized into separate test invocations. - Input messages (
input_messages
): Accepts either a single row asList[Message]
or many rows asList[List[Message]]
. WhenEP_MAX_DATASET_ROWS
is set, the list is sliced before parameterization. - Input rows (
input_rows
): Similar to input_messages, whenEP_MAX_DATASET_ROWS
is set, the list is sliced before parameterization. - Dataset adapter (
dataset_adapter
): Receives raw JSONL rows (as loaded byload_jsonl()
) and must returnList[EvaluationRow]
.
Error Handling
The decorator handles errors gracefully:- Failed rollouts are still evaluated (you can choose to give them a score of 0)
- Assertion errors are logged with status “finished”
- Other exceptions are logged with status “error”
- Summary generation failures don’t cause test failures
- For retry behavior and configuration, see ExceptionHandlerConfig and Retries and failure policy.
Row IDs and metadata
- Stable
row_id
values are generated for rows missingrow.input_metadata.row_id
, using a deterministic hash of row content. This ensures consistent IDs across processes and runs. EvalMetadata
is created for each evaluation with:name
(test function name),description
(docstring),num_runs
,aggregation_method
, and threshold info. Itsstatus
transitions from “running” to “finished” or “error”.completion_params
used for a row are recorded inrow.input_metadata.completion_params
.
Dataset combination and parameterization
- Parameter combinations are generated across
data_loaders
,input_dataset
,completion_params
,input_messages
,input_rows
, andevaluation_test_kwargs
. - Pytest parameter names (in order when present):
dataset_path
,completion_params
,input_messages
,input_rows
,data_loaders
,evaluation_test_kwargs
. - Set
combine_datasets=False
to parameterize each dataset path separately. WithTrue
(default), multiple paths are combined into a single logical dataset per invocation.
Summary artifacts
WhenEP_SUMMARY_JSON
is set:
- If a directory or a non-
.json
path is provided, a file is written inside with the base name:"{suite}__{model}__{mode}__runs{num_runs}.json"
, wheresuite
is the test function name andmodel
is a sanitized slug. - If a file path is provided, it writes that file. If an “effort” tag is detected in
completion_params
(e.g.,reasoning_effort
), a variant suffixed with__{effort}
is written instead. - The summary includes:
suite
,model
,agg_score
,num_runs
,rows
, optional 95% CI (agg_ci_low
,agg_ci_high
) whenaggregation_method="mean"
, and atimestamp
.
Retries and failure policy
- Rollouts are retried up to
EP_MAX_RETRY
times using therollout_processor_with_retry
wrapper. - Permanent failures are, by default, raised immediately to fail the test. Override with
EP_FAIL_ON_MAX_RETRY=false
to continue and include errored rows (you can score them as 0 in your evaluation). - Exception handling and retry logic can be customized via
exception_handler_config
.
Environment Variables
The following environment variables control retry behavior:EP_MAX_RETRY
: Maximum number of retry attempts (default: 0, meaning no retries)EP_FAIL_ON_MAX_RETRY
: Whether to fail the test after max retries (default: “true”)
Retry Implementation Details
The retry logic is implemented in therollout_processor_with_retry
function which:
- Wraps the rollout processor with configurable backoff retry
- Handles both retryable and non-retryable exceptions
- Uses the Python
backoff
library for exponential/constant backoff strategies - Processes rows concurrently while handling retries transparently
- Logs all results (success or failure) through the configured logger
Custom Retry Configuration
For advanced retry logic, you can provide a customExceptionHandlerConfig
:
Rollout processors
- A rollout processor turns input rows into completed rows (e.g., by calling a model). The decorator passes a
RolloutProcessorConfig
containingcompletion_params
,mcp_config_path
,server_script_path
,max_concurrent_rollouts
, andsteps
. - Built-ins include:
NoOpRolloutProcessor()
: passes rows through unchanged (useful for offline evaluation of pre-generated outputs).SingleTurnRolloutProcessor()
: performs a single chat completion via LiteLLM and appends the assistant message.AgentRolloutProcessor()
: runs multi-turn agent loops with MCP tool calling.PydanticAgentRolloutProcessor()
: runs Pydantic AI agents with structured tool calling.MCPGymRolloutProcessor()
: runs interactive environments via MCP servers.
- All processors are wrapped with
rollout_processor_with_retry
for automatic retry handling.
RolloutProcessorConfig
TheRolloutProcessorConfig
is passed to all rollout processors and contains the configuration needed to execute rollouts. It’s defined in eval_protocol/pytest/types.py
.
Configuration Fields
Model and generation parameters for the rollout. The structure and required fields depend on the rollout processor:SingleTurnRolloutProcessor & AgentRolloutProcessor:
- Must include
model
field with LiteLLM-compatible provider route - Supports standard LiteLLM parameters:
temperature
,max_tokens
,extra_body
, etc.
- Must include
model
field (canonical way to pass model names) - Optional
provider
field (defaults to “openai” if not specified) - Used to create Pydantic AI model instances via the agent factory
- Must include
model
field for environment policy creation - Additional parameters passed to the policy constructor
- Can contain any values (not used for actual model calls)
- Often set to placeholder values for clarity
Path to an MCP client configuration file that follows the MCPMultiClientConfiguration schema. Used by agent and tool-based rollout processors to enumerate available tools and capabilities.
Shared semaphore for unified concurrency control across all rollout processors. Controls the maximum number of concurrent rollouts that can run simultaneously.
Path to an MCP server script to run. Used by gym-like processors (e.g.,
MCPGymRolloutProcessor
) to launch interactive environments. Defaults to None
.Maximum number of rollout steps to execute. Used by multi-turn processors to limit the length of agent conversations. Defaults to
30
.Logger to use for capturing mid-rollout logs and debugging information. Defaults to
default_logger
.Additional keyword arguments specific to the rollout processor. This is where processor-specific configuration is passed, such as:
usage_limits
for Pydantic AI agentsagent
for pre-configured agents- Custom tool configurations
- Environment-specific settings
Configuration for exception handling and backoff retry logic. If not provided, a default configuration will be used with common retryable exceptions. See the ExceptionHandlerConfig section for detailed configuration options.
Usage in Custom Rollout Processors
When implementing custom rollout processors, you can access these configuration values:Environment Variable Integration
Several configuration values can be overridden at runtime using environment variables:EP_MAX_CONCURRENT_ROLLOUTS
: Overrides the semaphore limitEP_NUM_RUNS
: Affects the number of runs for evaluation_testEP_MAX_RETRY
: Controls retry behavior via exception_handler_configEP_FAIL_ON_MAX_RETRY
: Controls failure behavior after max retries
Processor-Specific Configuration
Different rollout processors use thekwargs
field for their specific needs:
AgentRolloutProcessor
PydanticAgentRolloutProcessor
MCPGymRolloutProcessor
Direct invocation (dual-mode)
Decorated functions can be called directly in addition to running under pytest:- Pointwise mode:
await test_fn(row)
orawait test_fn(row=...)
- Groupwise mode:
await test_fn(rows)
orawait test_fn(rows=[...])
- All mode:
await test_fn(rows)
orawait test_fn(rows=[...])
data_loaders
, direct invocation works the same way; the decorator resolves loaders into rows before calling your function. If a decorated function is called directly with row
/rows
arguments, those are used as-is.
Examples
Basic Math Evaluation (Pointwise Mode)
Multi-Model Comparison (All Mode)
Groupwise Evaluation for Model Comparison
Pointwise Evaluation with Custom Dataset
Complete runnable example (offline, no model calls)
This example evaluates pre-generated assistant messages using the no-op rollout processor.Complete runnable example (single-turn online via LiteLLM)
Using data_loaders with DynamicDataLoader
pip install litellm
and provider credentials configured.
Integration with pytest
The decorator automatically creates pytest-compatible test functions:Programmatic Usage
Decorated functions can be called directly in addition to running under pytest. See Direct invocation (dual-mode) for patterns by mode.Best Practices
- Clear Documentation: Always include docstrings explaining what your evaluation measures
- Error Handling: Handle edge cases gracefully and provide meaningful scores for failed rollouts
- Metric Design: Design metrics that are objective and reproducible
- Reason: Include a
reason
field in theevaluation_result
to explain the score - Threshold Setting: Set realistic thresholds based on your use case
- Multiple Runs: Use
num_runs > 1
for more reliable results when possible - Resource Management: Consider
max_concurrent_rollouts
andmax_concurrent_evaluations
based on your system capabilities - Mode Selection: Choose the appropriate mode for your evaluation needs:
- Use “pointwise” for simple per-row evaluation
- Use “groupwise” for comparing multiple models/parameters on the same inputs
- Use “all” for batch processing with cross-row analysis
Troubleshooting
Common Issues
- “No combinations of parameters found”: Ensure you provide both
completion_params
and eitherinput_dataset
,input_messages
, orinput_rows
- “No model provided”: Check that your
CompletionParams
includes amodel
field - Signature validation errors: Ensure your function signature matches the mode requirements:
- Pointwise mode:
def func(row: EvaluationRow) -> EvaluationRow
- Groupwise mode:
def func(rows: List[EvaluationRow]) -> List[EvaluationRow]
- All mode:
def func(rows: List[EvaluationRow]) -> List[EvaluationRow]
- Pointwise mode:
- Return type errors: Verify you’re returning the correct type based on your mode
- “In groupwise mode, you must provide at least 2 completion parameters”: Groupwise mode requires multiple completion parameters to compare
Debug Tips
- Set
EP_PRINT_SUMMARY=1
to see evaluation results in console - Use
EP_SUMMARY_JSON
to save detailed results to a file - Check the generated pytest parameterization for complex setups
- Use
max_dataset_rows
to limit dataset size during development - Monitor
max_concurrent_rollouts
andmax_concurrent_evaluations
for performance tuning