> ## Documentation Index
> Fetch the complete documentation index at: https://evalprotocol.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Rollout Processors

> Overview of built-in rollout processors, their configs, and when to use each

Rollout processors are classes that implement a common interface to turn input `EvaluationRow`s into completed rows (e.g., by calling a model once, running a tool-using agent loop, or interacting with an MCP "gym"). They all implement the same Python interface:

```python theme={null}
from typing import List
import asyncio
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest.types import RolloutProcessorConfig

class RolloutProcessor:
    def __call__(self, rows: List[EvaluationRow], config: RolloutProcessorConfig) -> List[asyncio.Task[EvaluationRow]]:
        ...  # return asyncio Tasks that resolve to completed rows

    def cleanup(self) -> None:
        ...  # optional; release external resources (servers, temp files)
```

The config object is defined in [`eval_protocol/pytest/types.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/types.py) as `RolloutProcessorConfig` and includes the most common knobs for evaluation runs. The interface lives in [`eval_protocol/pytest/rollout_processor.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/rollout_processor.py). The evaluation framework awaits these tasks with per-row retries and finally calls `cleanup()` for you.

## Config: RolloutProcessorConfig

* **completion\_params**: model and generation parameters (provider-agnostic via LiteLLM). Must include `model`.
* **mcp\_config\_path**: path to an MCP client configuration file (used by agent/tool processors).
* **server\_script\_path**: path to an MCP server script (used by gym-like processors).
* **max\_concurrent\_rollouts**: maximum number of rows processed in parallel (default 8).
* **steps**: maximum rollout steps for multi-turn processors (default 30).
* **logger**: `DatasetLogger` to capture mid-rollout logs.
* **kwargs**: extra, processor-specific options.
* **exception\_handler\_config**: controls automatic backoff/retry for rollout errors. See ExceptionHandlerConfig in the `@evaluation_test` reference.

Tip: You can override certain input parameters at runtime with the pytest plugin flags (see below), e.g., `--ep-reasoning-effort` or `--ep-input-param`.

## Built-in processors

### NoOpRolloutProcessor

* **What it does**: Pass-through. Returns tasks that immediately resolve to the same rows, so you can handle rollout yourself inside the evaluation function.
* **When to use**: You already have model outputs precomputed or you want to implement rollout logic in the test body.
* **Module**: [`eval_protocol/pytest/default_no_op_rollout_processor.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/default_no_op_rollout_processor.py)

Usage with `@evaluation_test`:

```python theme={null}
from eval_protocol.pytest import evaluation_test, NoOpRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o-mini"}],
    rollout_processor=NoOpRolloutProcessor(),
)
def my_eval(rows):
    # rows are unchanged; compute scores here
    return rows
```

### SingleTurnRolloutProcessor

* **What it does**: Issues a single LiteLLM `completion` per row and appends the assistant message (and any tool\_calls) to `row.messages`.
* **When to use**: Single-turn prompts, static QA, or benchmarks that only need the model's immediate reply.
* **Respects**: `completion_params` and forwards `reasoning_effort` under `extra_body` when present.
* **Module**: [`eval_protocol/pytest/default_single_turn_rollout_process.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/default_single_turn_rollout_process.py)

Usage:

```python theme={null}
from eval_protocol.pytest import evaluation_test, SingleTurnRolloutProcessor

@evaluation_test(
    completion_params=[{
        "model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b",
        "temperature": 0.0,
        "extra_body": {"reasoning_effort": "low"},  # forwarded to providers that support it
    }],
    rollout_processor=SingleTurnRolloutProcessor(),
)
def single_turn_eval(rows):
    # each row now contains the assistant's reply; compute scores
    return rows
```

### AgentRolloutProcessor

* **What it does**: Runs a simple multi-turn agent that can call MCP tools. The agent:
  * Calls the model with current `messages` and available tools.
  * Executes any returned tool calls in parallel.
  * Appends tool results then calls the model again, until there are no more tool calls.
* **When to use**: Tool-augmented tasks, function-calling, or scenarios requiring iterative reasoning via tools.
* **Requires**: `mcp_config_path` to enumerate available tools via `MCPMultiClient`.
* **Honors**: `max_concurrent_rollouts` for dataset-level parallelism; tool calls within a single row are also executed in parallel.
* **Module**: [`eval_protocol/pytest/default_agent_rollout_processor.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/default_agent_rollout_processor.py)

Usage:

```python theme={null}
from eval_protocol.pytest import evaluation_test, AgentRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o"}],
    rollout_processor=AgentRolloutProcessor(),
    mcp_config_path="./path/to/mcp.config.json",
    max_concurrent_rollouts=8,
    steps=30,  # upper bound; the agent stops earlier if no tools are requested
)
def agent_eval(rows):
    return rows
```

### PydanticAgentRolloutProcessor

* **What it does**: Runs Pydantic AI agents with automatic message format conversion between eval-protocol and Pydantic AI formats.
* **When to use**: **ONLY for Pydantic AI framework.** Multi-turn conversations, tool usage scenarios, and complex agent workflows.
* **Requires**: `agent_factory` parameter - a callable that creates a Pydantic AI `Agent` instance from `RolloutProcessorConfig`.
* **Module**: [`eval_protocol/pytest/default_pydantic_ai_rollout_processor.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/default_pydantic_ai_rollout_processor.py)
* **See also**: [Pydantic AI integration guide](/integrations/pydantic-ai) for detailed examples and agent factory patterns.

The processor automatically converts message formats and handles concurrency control:

| Eval-Protocol Role | Pydantic AI Conversion |
| ------------------ | ---------------------- |
| `user`             | `UserPromptPart`       |
| `system`           | `SystemPromptPart`     |
| `assistant`        | `ChatCompletion`       |
| `tool`             | `ToolReturnPart`       |

Example agent factory:

<Note>
  The examples assume a `setup_agent` function exists that creates and configures your Pydantic AI agent.
</Note>

```python theme={null}
from eval_protocol.pytest import evaluation_test, PydanticAgentRolloutProcessor
from pydantic_ai.usage import UsageLimits

def agent_factory(config: RolloutProcessorConfig) -> Agent:
    model_name = config.completion_params["model"]
    # Provider is optional - defaults to "openai" if not specified
    provider = config.completion_params.get("provider", "openai")
    model = OpenAIChatModel(model_name, provider=provider)
    return setup_agent(model)

@evaluation_test(
    input_messages=[Message(role="user", content="Hello, how are you?")],
    completion_params=[{
        "model": "accounts/fireworks/models/gpt-oss-120b",
        "provider": "fireworks"  # Optional: defaults to "openai"
    }],
    rollout_processor=PydanticAgentRolloutProcessor(
        agent_factory=agent_factory,
        usage_limits=UsageLimits(max_tokens=1000)
    ),
    mode="pointwise"
)
def test_pydantic_agent(row: EvaluationRow) -> EvaluationRow:
    return row
```

Multi-agent scenario:

```python theme={null}
from eval_protocol.pytest import evaluation_test, PydanticAgentRolloutProcessor
from pydantic_ai import Agent, RunContext
from pydantic_ai.models import Model
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.usage import UsageLimits
import pytest

def setup_agent(joke_generation_model: Model, joke_selection_model: Model) -> Agent:
    """Setup multi-agent system with joke generation and selection."""
    joke_selection_agent = Agent(
        model=joke_selection_model,
        system_prompt="Use the `joke_factory` to generate some jokes, then choose the best. You must return just a single joke."
    )
    joke_generation_agent = Agent(joke_generation_model, output_type=list[str])

    @joke_selection_agent.tool
    async def joke_factory(ctx: RunContext[None], count: int) -> list[str]:
        r = await joke_generation_agent.run(
            f"Please generate {count} jokes.",
            usage=ctx.usage,
        )
        return r.output

    return joke_selection_agent

def agent_factory(config: RolloutProcessorConfig) -> Agent:
    joke_generation_model = OpenAIChatModel(
        config.completion_params["model"]["joke_generation_model"], provider="fireworks"
    )
    joke_selection_model = OpenAIChatModel(
        config.completion_params["model"]["joke_selection_model"], provider="fireworks"
    )
    return setup_agent(joke_generation_model, joke_selection_model)

@pytest.mark.asyncio
@evaluation_test(
    input_messages=[[[Message(role="user", content="Tell me a joke.")]]],
    completion_params=[{
        "model": {
            "joke_generation_model": "accounts/fireworks/models/kimi-k2-instruct",
            "joke_selection_model": "accounts/fireworks/models/deepseek-v3p1"
        }
    }],
    rollout_processor=PydanticAgentRolloutProcessor(
        agent_factory=agent_factory,
        usage_limits=UsageLimits(request_limit=5, total_tokens_limit=1000)
    ),
    mode="pointwise"
)
async def test_pydantic_multi_agent(row: EvaluationRow) -> EvaluationRow:
    return row
```

### MCPGymRolloutProcessor

* **What it does**: Spins up an MCP server (e.g., tau-bench style), creates environments, and runs rollouts through `eval_protocol.rollout(...)`.
* **When to use**: Interactive environments or "gym" tasks exposed over MCP.
* **Requires**: `server_script_path` to launch the MCP server. Binds `localhost:9700` by default.
* **Module**: [`eval_protocol/pytest/default_mcp_gym_rollout_processor.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/default_mcp_gym_rollout_processor.py)

Usage:

```python theme={null}
from eval_protocol.pytest import evaluation_test, MCPGymRolloutProcessor

@evaluation_test(
    completion_params=[{"model": "openai/gpt-4o"}],
    rollout_processor=MCPGymRolloutProcessor(),
    server_script_path="examples/tau2_mcp/server.py",
    steps=30,
)
def gym_eval(rows):
    return rows
```

### RemoteRolloutProcessor (HTTP)

Using a remote HTTP service to perform the rollout is advanced. See [Remote
Rollout Processor](/tutorial/remote-rollout-processor) for more details.

## Pytest plugin helpers (CLI flags)

The pytest plugin in [`eval_protocol/pytest/plugin.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/plugin.py) adds flags to make evaluations CI-friendly:

* `--ep-max-rows=N|all`: limit dataset rows processed.
* `--ep-num-runs=N`: override the number of runs for evaluation\_test.
* `--ep-max-concurrent-rollouts=N`: override the maximum number of concurrent rollouts.
* `--ep-print-summary`: print a concise summary line at end of each run.
* `--ep-summary-json=PATH`: write a JSON artifact for CI.
* `--ep-input-param key=value` or `--ep-input-param @params.json`: ad-hoc overrides of `completion_params`.
* `--ep-reasoning-effort low|medium|high|none`: sets `extra_body.reasoning_effort` via LiteLLM.
* `--ep-max-retry=N`: set maximum retry attempts for failed rollouts.
* `--ep-fail-on-max-retry true|false`: whether to fail the entire rollout when permanent failures occur after max retries.

Example:

```bash theme={null}
pytest -k my_eval --ep-print-summary --ep-summary-json artifacts/my_eval.json --ep-max-rows 50 --ep-max-concurrent-rollouts 16
```

## Choosing a processor

* Use **single-turn** for simple QA and classification.
* Use **agent** when you need tool calls or iterative reasoning.
* Use **Pydantic AI agent** only for Pydantic AI framework.
* Use **MCP gym** for interactive environments hosted as MCP servers.
* Use **no-op** if you want full control inside your test body.

All processors stream results as they complete with bounded concurrency, so large datasets can run efficiently.