EvaluationRows into completed rows (e.g., by calling a model once, running a tool-using agent loop, or interacting with an MCP “gym”). They all implement the same Python interface:
eval_protocol/pytest/types.py as RolloutProcessorConfig and includes the most common knobs for evaluation runs. The interface lives in eval_protocol/pytest/rollout_processor.py. The evaluation framework awaits these tasks with per-row retries and finally calls cleanup() for you.
Config: RolloutProcessorConfig
- completion_params: model and generation parameters (provider-agnostic via LiteLLM). Must include
model. - mcp_config_path: path to an MCP client configuration file (used by agent/tool processors).
- server_script_path: path to an MCP server script (used by gym-like processors).
- max_concurrent_rollouts: maximum number of rows processed in parallel (default 8).
- steps: maximum rollout steps for multi-turn processors (default 30).
- logger:
DatasetLoggerto capture mid-rollout logs. - kwargs: extra, processor-specific options.
- exception_handler_config: controls automatic backoff/retry for rollout errors. See ExceptionHandlerConfig in the
@evaluation_testreference.
--ep-reasoning-effort or --ep-input-param.
Built-in processors
NoOpRolloutProcessor
- What it does: Pass-through. Returns tasks that immediately resolve to the same rows, so you can handle rollout yourself inside the evaluation function.
- When to use: You already have model outputs precomputed or you want to implement rollout logic in the test body.
- Module:
eval_protocol/pytest/default_no_op_rollout_processor.py
@evaluation_test:
SingleTurnRolloutProcessor
- What it does: Issues a single LiteLLM
completionper row and appends the assistant message (and any tool_calls) torow.messages. - When to use: Single-turn prompts, static QA, or benchmarks that only need the model’s immediate reply.
- Respects:
completion_paramsand forwardsreasoning_effortunderextra_bodywhen present. - Module:
eval_protocol/pytest/default_single_turn_rollout_process.py
AgentRolloutProcessor
- What it does: Runs a simple multi-turn agent that can call MCP tools. The agent:
- Calls the model with current
messagesand available tools. - Executes any returned tool calls in parallel.
- Appends tool results then calls the model again, until there are no more tool calls.
- Calls the model with current
- When to use: Tool-augmented tasks, function-calling, or scenarios requiring iterative reasoning via tools.
- Requires:
mcp_config_pathto enumerate available tools viaMCPMultiClient. - Honors:
max_concurrent_rolloutsfor dataset-level parallelism; tool calls within a single row are also executed in parallel. - Module:
eval_protocol/pytest/default_agent_rollout_processor.py
PydanticAgentRolloutProcessor
- What it does: Runs Pydantic AI agents with automatic message format conversion between eval-protocol and Pydantic AI formats.
- When to use: ONLY for Pydantic AI framework. Multi-turn conversations, tool usage scenarios, and complex agent workflows.
- Requires:
agent_factoryparameter - a callable that creates a Pydantic AIAgentinstance fromRolloutProcessorConfig. - Module:
eval_protocol/pytest/default_pydantic_ai_rollout_processor.py - See also: Pydantic AI integration guide for detailed examples and agent factory patterns.
| Eval-Protocol Role | Pydantic AI Conversion |
|---|---|
user | UserPromptPart |
system | SystemPromptPart |
assistant | ChatCompletion |
tool | ToolReturnPart |
The examples assume a
setup_agent function exists that creates and configures your Pydantic AI agent.MCPGymRolloutProcessor
- What it does: Spins up an MCP server (e.g., tau-bench style), creates environments, and runs rollouts through
eval_protocol.rollout(...). - When to use: Interactive environments or “gym” tasks exposed over MCP.
- Requires:
server_script_pathto launch the MCP server. Bindslocalhost:9700by default. - Module:
eval_protocol/pytest/default_mcp_gym_rollout_processor.py
RemoteRolloutProcessor (HTTP)
Using a remote HTTP service to perform the rollout is advanced. See Remote Rollout Processor for more details.Pytest plugin helpers (CLI flags)
The pytest plugin ineval_protocol/pytest/plugin.py adds flags to make evaluations CI-friendly:
--ep-max-rows=N|all: limit dataset rows processed.--ep-num-runs=N: override the number of runs for evaluation_test.--ep-max-concurrent-rollouts=N: override the maximum number of concurrent rollouts.--ep-print-summary: print a concise summary line at end of each run.--ep-summary-json=PATH: write a JSON artifact for CI.--ep-input-param key=valueor--ep-input-param @params.json: ad-hoc overrides ofcompletion_params.--ep-reasoning-effort low|medium|high|none: setsextra_body.reasoning_effortvia LiteLLM.--ep-max-retry=N: set maximum retry attempts for failed rollouts.--ep-fail-on-max-retry true|false: whether to fail the entire rollout when permanent failures occur after max retries.
Choosing a processor
- Use single-turn for simple QA and classification.
- Use agent when you need tool calls or iterative reasoning.
- Use Pydantic AI agent only for Pydantic AI framework.
- Use MCP gym for interactive environments hosted as MCP servers.
- Use no-op if you want full control inside your test body.

