Rollout Processors

Rollout processors are classes that implement a common interface to turn input EvaluationRows into completed rows (e.g., by calling a model once, running a tool-using agent loop, or interacting with an MCP “gym”). They all implement the same Python interface:

from typing import List
import asyncio
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest.types import RolloutProcessorConfig

class RolloutProcessor:
    def __call__(self, rows: List[EvaluationRow], config: RolloutProcessorConfig) -> List[asyncio.Task[EvaluationRow]]:
        ...  # return asyncio Tasks that resolve to completed rows

    def cleanup(self) -> None:
        ...  # optional; release external resources (servers, temp files)

The config object is defined in eval_protocol/pytest/types.py as RolloutProcessorConfig and includes the most common knobs for evaluation runs. The interface lives in eval_protocol/pytest/rollout_processor.py. The evaluation framework awaits these tasks with per-row retries and finally calls cleanup() for you.

Config: RolloutProcessorConfig

completion_params: model and generation parameters (provider-agnostic via LiteLLM). Must include model.
mcp_config_path: path to an MCP client configuration file (used by agent/tool processors).
server_script_path: path to an MCP server script (used by gym-like processors).
max_concurrent_rollouts: maximum number of rows processed in parallel (default 8).
steps: maximum rollout steps for multi-turn processors (default 30).
logger: DatasetLogger to capture mid-rollout logs.
kwargs: extra, processor-specific options.
exception_handler_config: controls automatic backoff/retry for rollout errors. See ExceptionHandlerConfig in the @evaluation_test reference.

Tip: You can override certain input parameters at runtime with the pytest plugin flags (see below), e.g., --ep-reasoning-effort or --ep-input-param.

Built-in processors

NoOpRolloutProcessor

What it does: Pass-through. Returns tasks that immediately resolve to the same rows, so you can handle rollout yourself inside the evaluation function.
When to use: You already have model outputs precomputed or you want to implement rollout logic in the test body.
Module: eval_protocol/pytest/default_no_op_rollout_processor.py

Usage with @evaluation_test:

from eval_protocol.pytest import evaluation_test, NoOpRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o-mini"}],
    rollout_processor=NoOpRolloutProcessor(),
)
def my_eval(rows):
    # rows are unchanged; compute scores here
    return rows

SingleTurnRolloutProcessor

What it does: Issues a single LiteLLM completion per row and appends the assistant message (and any tool_calls) to row.messages.
When to use: Single-turn prompts, static QA, or benchmarks that only need the model’s immediate reply.
Respects: completion_params and forwards reasoning_effort under extra_body when present.
Module: eval_protocol/pytest/default_single_turn_rollout_process.py

Usage:

from eval_protocol.pytest import evaluation_test, SingleTurnRolloutProcessor

@evaluation_test(
    completion_params=[{
        "model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b",
        "temperature": 0.0,
        "extra_body": {"reasoning_effort": "low"},  # forwarded to providers that support it
    }],
    rollout_processor=SingleTurnRolloutProcessor(),
)
def single_turn_eval(rows):
    # each row now contains the assistant's reply; compute scores
    return rows

AgentRolloutProcessor

What it does: Runs a simple multi-turn agent that can call MCP tools. The agent:
- Calls the model with current messages and available tools.
- Executes any returned tool calls in parallel.
- Appends tool results then calls the model again, until there are no more tool calls.
When to use: Tool-augmented tasks, function-calling, or scenarios requiring iterative reasoning via tools.
Requires: mcp_config_path to enumerate available tools via MCPMultiClient.
Honors: max_concurrent_rollouts for dataset-level parallelism; tool calls within a single row are also executed in parallel.
Module: eval_protocol/pytest/default_agent_rollout_processor.py

Usage:

from eval_protocol.pytest import evaluation_test, AgentRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o"}],
    rollout_processor=AgentRolloutProcessor(),
    mcp_config_path="./path/to/mcp.config.json",
    max_concurrent_rollouts=8,
    steps=30,  # upper bound; the agent stops earlier if no tools are requested
)
def agent_eval(rows):
    return rows

PydanticAgentRolloutProcessor

What it does: Runs Pydantic AI agents with automatic message format conversion between eval-protocol and Pydantic AI formats.
When to use: ONLY for Pydantic AI framework. Multi-turn conversations, tool usage scenarios, and complex agent workflows.
Requires: agent_factory parameter - a callable that creates a Pydantic AI Agent instance from RolloutProcessorConfig.
Module: eval_protocol/pytest/default_pydantic_ai_rollout_processor.py
See also: Pydantic AI integration guide for detailed examples and agent factory patterns.

The processor automatically converts message formats and handles concurrency control:

Eval-Protocol Role	Pydantic AI Conversion
`user`	`UserPromptPart`
`system`	`SystemPromptPart`
`assistant`	`ChatCompletion`
`tool`	`ToolReturnPart`

Example agent factory:

The examples assume a setup_agent function exists that creates and configures your Pydantic AI agent.

from eval_protocol.pytest import evaluation_test, PydanticAgentRolloutProcessor
from pydantic_ai.usage import UsageLimits

def agent_factory(config: RolloutProcessorConfig) -> Agent:
    model_name = config.completion_params["model"]
    # Provider is optional - defaults to "openai" if not specified
    provider = config.completion_params.get("provider", "openai")
    model = OpenAIChatModel(model_name, provider=provider)
    return setup_agent(model)

@evaluation_test(
    input_messages=[Message(role="user", content="Hello, how are you?")],
    completion_params=[{
        "model": "accounts/fireworks/models/gpt-oss-120b",
        "provider": "fireworks"  # Optional: defaults to "openai"
    }],
    rollout_processor=PydanticAgentRolloutProcessor(
        agent_factory=agent_factory,
        usage_limits=UsageLimits(max_tokens=1000)
    ),
    mode="pointwise"
)
def test_pydantic_agent(row: EvaluationRow) -> EvaluationRow:
    return row

Multi-agent scenario:

from eval_protocol.pytest import evaluation_test, PydanticAgentRolloutProcessor
from pydantic_ai import Agent, RunContext
from pydantic_ai.models import Model
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.usage import UsageLimits
import pytest

def setup_agent(joke_generation_model: Model, joke_selection_model: Model) -> Agent:
    """Setup multi-agent system with joke generation and selection."""
    joke_selection_agent = Agent(
        model=joke_selection_model,
        system_prompt="Use the `joke_factory` to generate some jokes, then choose the best. You must return just a single joke."
    )
    joke_generation_agent = Agent(joke_generation_model, output_type=list[str])

    @joke_selection_agent.tool
    async def joke_factory(ctx: RunContext[None], count: int) -> list[str]:
        r = await joke_generation_agent.run(
            f"Please generate {count} jokes.",
            usage=ctx.usage,
        )
        return r.output

    return joke_selection_agent

def agent_factory(config: RolloutProcessorConfig) -> Agent:
    joke_generation_model = OpenAIChatModel(
        config.completion_params["model"]["joke_generation_model"], provider="fireworks"
    )
    joke_selection_model = OpenAIChatModel(
        config.completion_params["model"]["joke_selection_model"], provider="fireworks"
    )
    return setup_agent(joke_generation_model, joke_selection_model)

@pytest.mark.asyncio
@evaluation_test(
    input_messages=[[[Message(role="user", content="Tell me a joke.")]]],
    completion_params=[{
        "model": {
            "joke_generation_model": "accounts/fireworks/models/kimi-k2-instruct",
            "joke_selection_model": "accounts/fireworks/models/deepseek-v3p1"
        }
    }],
    rollout_processor=PydanticAgentRolloutProcessor(
        agent_factory=agent_factory,
        usage_limits=UsageLimits(request_limit=5, total_tokens_limit=1000)
    ),
    mode="pointwise"
)
async def test_pydantic_multi_agent(row: EvaluationRow) -> EvaluationRow:
    return row

MCPGymRolloutProcessor

What it does: Spins up an MCP server (e.g., tau-bench style), creates environments, and runs rollouts through eval_protocol.rollout(...).
When to use: Interactive environments or “gym” tasks exposed over MCP.
Requires: server_script_path to launch the MCP server. Binds localhost:9700 by default.
Module: eval_protocol/pytest/default_mcp_gym_rollout_processor.py

Usage:

from eval_protocol.pytest import evaluation_test, MCPGymRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o"}],
    rollout_processor=MCPGymRolloutProcessor(),
    server_script_path="examples/tau2_mcp/server.py",
    steps=30,
)
def gym_eval(rows):
    return rows

RemoteRolloutProcessor (HTTP)

Using a remote HTTP service to perform the rollout is advanced. See Remote Rollout Processor for more details.

Pytest plugin helpers (CLI flags)

The pytest plugin in eval_protocol/pytest/plugin.py adds flags to make evaluations CI-friendly:

--ep-max-rows=N|all: limit dataset rows processed.
--ep-num-runs=N: override the number of runs for evaluation_test.
--ep-max-concurrent-rollouts=N: override the maximum number of concurrent rollouts.
--ep-print-summary: print a concise summary line at end of each run.
--ep-summary-json=PATH: write a JSON artifact for CI.
--ep-input-param key=value or --ep-input-param @params.json: ad-hoc overrides of completion_params.
--ep-reasoning-effort low|medium|high|none: sets extra_body.reasoning_effort via LiteLLM.
--ep-max-retry=N: set maximum retry attempts for failed rollouts.
--ep-fail-on-max-retry true|false: whether to fail the entire rollout when permanent failures occur after max retries.

Example:

pytest -k my_eval --ep-print-summary --ep-summary-json artifacts/my_eval.json --ep-max-rows 50 --ep-max-concurrent-rollouts 16

Choosing a processor

Use single-turn for simple QA and classification.
Use agent when you need tool calls or iterative reasoning.
Use Pydantic AI agent only for Pydantic AI framework.
Use MCP gym for interactive environments hosted as MCP servers.
Use no-op if you want full control inside your test body.

All processors stream results as they complete with bounded concurrency, so large datasets can run efficiently.

Getting Started

Integrations

Using the Logs UI

Reference

Config: RolloutProcessorConfig

Built-in processors

NoOpRolloutProcessor

SingleTurnRolloutProcessor

AgentRolloutProcessor

PydanticAgentRolloutProcessor

MCPGymRolloutProcessor

RemoteRolloutProcessor (HTTP)

Pytest plugin helpers (CLI flags)

Choosing a processor

Getting Started

Integrations

Using the Logs UI

Reference

​Config: RolloutProcessorConfig

​Built-in processors

​NoOpRolloutProcessor

​SingleTurnRolloutProcessor

​AgentRolloutProcessor

​PydanticAgentRolloutProcessor

​MCPGymRolloutProcessor

​RemoteRolloutProcessor (HTTP)

​Pytest plugin helpers (CLI flags)

​Choosing a processor

Config: RolloutProcessorConfig

Built-in processors

NoOpRolloutProcessor

SingleTurnRolloutProcessor

AgentRolloutProcessor

PydanticAgentRolloutProcessor

MCPGymRolloutProcessor

RemoteRolloutProcessor (HTTP)

Pytest plugin helpers (CLI flags)

Choosing a processor