Skip to main content
Rollout processors are classes that implement a common interface to turn input EvaluationRows into completed rows (e.g., by calling a model once, running a tool-using agent loop, or interacting with an MCP “gym”). They all implement the same Python interface:
from typing import List
import asyncio
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest.types import RolloutProcessorConfig

class RolloutProcessor:
    def __call__(self, rows: List[EvaluationRow], config: RolloutProcessorConfig) -> List[asyncio.Task[EvaluationRow]]:
        ...  # return asyncio Tasks that resolve to completed rows

    def cleanup(self) -> None:
        ...  # optional; release external resources (servers, temp files)
The config object is defined in eval_protocol/pytest/types.py as RolloutProcessorConfig and includes the most common knobs for evaluation runs. The interface lives in eval_protocol/pytest/rollout_processor.py. The evaluation framework awaits these tasks with per-row retries and finally calls cleanup() for you.

Config: RolloutProcessorConfig

  • completion_params: model and generation parameters (provider-agnostic via LiteLLM). Must include model.
  • mcp_config_path: path to an MCP client configuration file (used by agent/tool processors).
  • server_script_path: path to an MCP server script (used by gym-like processors).
  • max_concurrent_rollouts: maximum number of rows processed in parallel (default 8).
  • steps: maximum rollout steps for multi-turn processors (default 30).
  • logger: DatasetLogger to capture mid-rollout logs.
  • kwargs: extra, processor-specific options.
  • exception_handler_config: controls automatic backoff/retry for rollout errors. See ExceptionHandlerConfig in the @evaluation_test reference.
Tip: You can override certain input parameters at runtime with the pytest plugin flags (see below), e.g., --ep-reasoning-effort or --ep-input-param.

Built-in processors

NoOpRolloutProcessor

  • What it does: Pass-through. Returns tasks that immediately resolve to the same rows, so you can handle rollout yourself inside the evaluation function.
  • When to use: You already have model outputs precomputed or you want to implement rollout logic in the test body.
  • Module: eval_protocol/pytest/default_no_op_rollout_processor.py
Usage with @evaluation_test:
from eval_protocol.pytest import evaluation_test, NoOpRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o-mini"}],
    rollout_processor=NoOpRolloutProcessor(),
)
def my_eval(rows):
    # rows are unchanged; compute scores here
    return rows

SingleTurnRolloutProcessor

  • What it does: Issues a single LiteLLM completion per row and appends the assistant message (and any tool_calls) to row.messages.
  • When to use: Single-turn prompts, static QA, or benchmarks that only need the model’s immediate reply.
  • Respects: completion_params and forwards reasoning_effort under extra_body when present.
  • Module: eval_protocol/pytest/default_single_turn_rollout_process.py
Usage:
from eval_protocol.pytest import evaluation_test, SingleTurnRolloutProcessor

@evaluation_test(
    completion_params=[{
        "model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b",
        "temperature": 0.0,
        "extra_body": {"reasoning_effort": "low"},  # forwarded to providers that support it
    }],
    rollout_processor=SingleTurnRolloutProcessor(),
)
def single_turn_eval(rows):
    # each row now contains the assistant's reply; compute scores
    return rows

AgentRolloutProcessor

  • What it does: Runs a simple multi-turn agent that can call MCP tools. The agent:
    • Calls the model with current messages and available tools.
    • Executes any returned tool calls in parallel.
    • Appends tool results then calls the model again, until there are no more tool calls.
  • When to use: Tool-augmented tasks, function-calling, or scenarios requiring iterative reasoning via tools.
  • Requires: mcp_config_path to enumerate available tools via MCPMultiClient.
  • Honors: max_concurrent_rollouts for dataset-level parallelism; tool calls within a single row are also executed in parallel.
  • Module: eval_protocol/pytest/default_agent_rollout_processor.py
Usage:
from eval_protocol.pytest import evaluation_test, AgentRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o"}],
    rollout_processor=AgentRolloutProcessor(),
    mcp_config_path="./path/to/mcp.config.json",
    max_concurrent_rollouts=8,
    steps=30,  # upper bound; the agent stops earlier if no tools are requested
)
def agent_eval(rows):
    return rows

PydanticAgentRolloutProcessor

  • What it does: Runs Pydantic AI agents with automatic message format conversion between eval-protocol and Pydantic AI formats.
  • When to use: ONLY for Pydantic AI framework. Multi-turn conversations, tool usage scenarios, and complex agent workflows.
  • Requires: agent_factory parameter - a callable that creates a Pydantic AI Agent instance from RolloutProcessorConfig.
  • Module: eval_protocol/pytest/default_pydantic_ai_rollout_processor.py
  • See also: Pydantic AI integration guide for detailed examples and agent factory patterns.
The processor automatically converts message formats and handles concurrency control:
Eval-Protocol RolePydantic AI Conversion
userUserPromptPart
systemSystemPromptPart
assistantChatCompletion
toolToolReturnPart
Example agent factory:
The examples assume a setup_agent function exists that creates and configures your Pydantic AI agent.
from eval_protocol.pytest import evaluation_test, PydanticAgentRolloutProcessor
from pydantic_ai.usage import UsageLimits

def agent_factory(config: RolloutProcessorConfig) -> Agent:
    model_name = config.completion_params["model"]
    # Provider is optional - defaults to "openai" if not specified
    provider = config.completion_params.get("provider", "openai")
    model = OpenAIChatModel(model_name, provider=provider)
    return setup_agent(model)

@evaluation_test(
    input_messages=[Message(role="user", content="Hello, how are you?")],
    completion_params=[{
        "model": "accounts/fireworks/models/gpt-oss-120b",
        "provider": "fireworks"  # Optional: defaults to "openai"
    }],
    rollout_processor=PydanticAgentRolloutProcessor(
        agent_factory=agent_factory,
        usage_limits=UsageLimits(max_tokens=1000)
    ),
    mode="pointwise"
)
def test_pydantic_agent(row: EvaluationRow) -> EvaluationRow:
    return row
Multi-agent scenario:
from eval_protocol.pytest import evaluation_test, PydanticAgentRolloutProcessor
from pydantic_ai import Agent, RunContext
from pydantic_ai.models import Model
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.usage import UsageLimits
import pytest

def setup_agent(joke_generation_model: Model, joke_selection_model: Model) -> Agent:
    """Setup multi-agent system with joke generation and selection."""
    joke_selection_agent = Agent(
        model=joke_selection_model,
        system_prompt="Use the `joke_factory` to generate some jokes, then choose the best. You must return just a single joke."
    )
    joke_generation_agent = Agent(joke_generation_model, output_type=list[str])

    @joke_selection_agent.tool
    async def joke_factory(ctx: RunContext[None], count: int) -> list[str]:
        r = await joke_generation_agent.run(
            f"Please generate {count} jokes.",
            usage=ctx.usage,
        )
        return r.output

    return joke_selection_agent

def agent_factory(config: RolloutProcessorConfig) -> Agent:
    joke_generation_model = OpenAIChatModel(
        config.completion_params["model"]["joke_generation_model"], provider="fireworks"
    )
    joke_selection_model = OpenAIChatModel(
        config.completion_params["model"]["joke_selection_model"], provider="fireworks"
    )
    return setup_agent(joke_generation_model, joke_selection_model)

@pytest.mark.asyncio
@evaluation_test(
    input_messages=[[[Message(role="user", content="Tell me a joke.")]]],
    completion_params=[{
        "model": {
            "joke_generation_model": "accounts/fireworks/models/kimi-k2-instruct",
            "joke_selection_model": "accounts/fireworks/models/deepseek-v3p1"
        }
    }],
    rollout_processor=PydanticAgentRolloutProcessor(
        agent_factory=agent_factory,
        usage_limits=UsageLimits(request_limit=5, total_tokens_limit=1000)
    ),
    mode="pointwise"
)
async def test_pydantic_multi_agent(row: EvaluationRow) -> EvaluationRow:
    return row

MCPGymRolloutProcessor

  • What it does: Spins up an MCP server (e.g., tau-bench style), creates environments, and runs rollouts through eval_protocol.rollout(...).
  • When to use: Interactive environments or “gym” tasks exposed over MCP.
  • Requires: server_script_path to launch the MCP server. Binds localhost:9700 by default.
  • Module: eval_protocol/pytest/default_mcp_gym_rollout_processor.py
Usage:
from eval_protocol.pytest import evaluation_test, MCPGymRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o"}],
    rollout_processor=MCPGymRolloutProcessor(),
    server_script_path="examples/tau2_mcp/server.py",
    steps=30,
)
def gym_eval(rows):
    return rows

RemoteRolloutProcessor (HTTP)

Using a remote HTTP service to perform the rollout is advanced. See Remote Rollout Processor for more details.

Pytest plugin helpers (CLI flags)

The pytest plugin in eval_protocol/pytest/plugin.py adds flags to make evaluations CI-friendly:
  • --ep-max-rows=N|all: limit dataset rows processed.
  • --ep-num-runs=N: override the number of runs for evaluation_test.
  • --ep-max-concurrent-rollouts=N: override the maximum number of concurrent rollouts.
  • --ep-print-summary: print a concise summary line at end of each run.
  • --ep-summary-json=PATH: write a JSON artifact for CI.
  • --ep-input-param key=value or --ep-input-param @params.json: ad-hoc overrides of completion_params.
  • --ep-reasoning-effort low|medium|high|none: sets extra_body.reasoning_effort via LiteLLM.
  • --ep-max-retry=N: set maximum retry attempts for failed rollouts.
  • --ep-fail-on-max-retry true|false: whether to fail the entire rollout when permanent failures occur after max retries.
Example:
pytest -k my_eval --ep-print-summary --ep-summary-json artifacts/my_eval.json --ep-max-rows 50 --ep-max-concurrent-rollouts 16

Choosing a processor

  • Use single-turn for simple QA and classification.
  • Use agent when you need tool calls or iterative reasoning.
  • Use Pydantic AI agent only for Pydantic AI framework.
  • Use MCP gym for interactive environments hosted as MCP servers.
  • Use no-op if you want full control inside your test body.
All processors stream results as they complete with bounded concurrency, so large datasets can run efficiently.
I