EvaluationRow
s into completed rows (e.g., by calling a model once, running a tool-using agent loop, or interacting with an MCP “gym”). They all implement the same Python interface:
eval_protocol/pytest/types.py
as RolloutProcessorConfig
and includes the most common knobs for evaluation runs. The interface lives in eval_protocol/pytest/rollout_processor.py
. The evaluation framework awaits these tasks with per-row retries and finally calls cleanup()
for you.
Config: RolloutProcessorConfig
- completion_params: model and generation parameters (provider-agnostic via LiteLLM). Must include
model
. - mcp_config_path: path to an MCP client configuration file (used by agent/tool processors).
- server_script_path: path to an MCP server script (used by gym-like processors).
- max_concurrent_rollouts: maximum number of rows processed in parallel (default 8).
- steps: maximum rollout steps for multi-turn processors (default 30).
- logger:
DatasetLogger
to capture mid-rollout logs. - kwargs: extra, processor-specific options.
- exception_handler_config: controls automatic backoff/retry for rollout errors. See ExceptionHandlerConfig in the
@evaluation_test
reference.
--ep-reasoning-effort
or --ep-input-param
.
Built-in processors
NoOpRolloutProcessor
- What it does: Pass-through. Returns tasks that immediately resolve to the same rows, so you can handle rollout yourself inside the evaluation function.
- When to use: You already have model outputs precomputed or you want to implement rollout logic in the test body.
- Module:
eval_protocol/pytest/default_no_op_rollout_processor.py
@evaluation_test
:
SingleTurnRolloutProcessor
- What it does: Issues a single LiteLLM
completion
per row and appends the assistant message (and any tool_calls) torow.messages
. - When to use: Single-turn prompts, static QA, or benchmarks that only need the model’s immediate reply.
- Respects:
completion_params
and forwardsreasoning_effort
underextra_body
when present. - Module:
eval_protocol/pytest/default_single_turn_rollout_process.py
AgentRolloutProcessor
- What it does: Runs a simple multi-turn agent that can call MCP tools. The agent:
- Calls the model with current
messages
and available tools. - Executes any returned tool calls in parallel.
- Appends tool results then calls the model again, until there are no more tool calls.
- Calls the model with current
- When to use: Tool-augmented tasks, function-calling, or scenarios requiring iterative reasoning via tools.
- Requires:
mcp_config_path
to enumerate available tools viaMCPMultiClient
. - Honors:
max_concurrent_rollouts
for dataset-level parallelism; tool calls within a single row are also executed in parallel. - Module:
eval_protocol/pytest/default_agent_rollout_processor.py
PydanticAgentRolloutProcessor
- What it does: Runs Pydantic AI agents with automatic message format conversion between eval-protocol and Pydantic AI formats.
- When to use: ONLY for Pydantic AI framework. Multi-turn conversations, tool usage scenarios, and complex agent workflows.
- Requires:
agent_factory
parameter - a callable that creates a Pydantic AIAgent
instance fromRolloutProcessorConfig
. - Module:
eval_protocol/pytest/default_pydantic_ai_rollout_processor.py
- See also: Pydantic AI integration guide for detailed examples and agent factory patterns.
Eval-Protocol Role | Pydantic AI Conversion |
---|---|
user | UserPromptPart |
system | SystemPromptPart |
assistant | ChatCompletion |
tool | ToolReturnPart |
The examples assume a
setup_agent
function exists that creates and configures your Pydantic AI agent.MCPGymRolloutProcessor
- What it does: Spins up an MCP server (e.g., tau-bench style), creates environments, and runs rollouts through
eval_protocol.rollout(...)
. - When to use: Interactive environments or “gym” tasks exposed over MCP.
- Requires:
server_script_path
to launch the MCP server. Bindslocalhost:9700
by default. - Module:
eval_protocol/pytest/default_mcp_gym_rollout_processor.py
RemoteRolloutProcessor (HTTP)
Using a remote HTTP service to perform the rollout is advanced. See Remote Rollout Processor for more details.Pytest plugin helpers (CLI flags)
The pytest plugin ineval_protocol/pytest/plugin.py
adds flags to make evaluations CI-friendly:
--ep-max-rows=N|all
: limit dataset rows processed.--ep-num-runs=N
: override the number of runs for evaluation_test.--ep-max-concurrent-rollouts=N
: override the maximum number of concurrent rollouts.--ep-print-summary
: print a concise summary line at end of each run.--ep-summary-json=PATH
: write a JSON artifact for CI.--ep-input-param key=value
or--ep-input-param @params.json
: ad-hoc overrides ofcompletion_params
.--ep-reasoning-effort low|medium|high|none
: setsextra_body.reasoning_effort
via LiteLLM.--ep-max-retry=N
: set maximum retry attempts for failed rollouts.--ep-fail-on-max-retry true|false
: whether to fail the entire rollout when permanent failures occur after max retries.
Choosing a processor
- Use single-turn for simple QA and classification.
- Use agent when you need tool calls or iterative reasoning.
- Use Pydantic AI agent only for Pydantic AI framework.
- Use MCP gym for interactive environments hosted as MCP servers.
- Use no-op if you want full control inside your test body.