Skip to main content
The @evaluation_test decorator is the core component for creating pytest-based evaluation tests in the Evaluation Protocol. It enables you to evaluate AI models by running rollouts and applying evaluation criteria to measure performance.

What is an @evaluation_test?

An @evaluation_test is a pytest test with superpowers:
  • Takes in rows (from a dataset, loader, or hard‑coded messages)
  • Runs rollouts via a Rollout Processor (more on this on the next page)
  • Evaluates and writes scores onto each row
  • Aggregates results and surfaces a pass/fail signal for CI
You can think of it as:
  • pytest for orchestration
  • Eval Protocol for rollouts, retries, and logging
  • Your function body for scoring
For the full API, see the reference page. This guide focuses on the shortest path to something useful.

Smallest useful example (pointwise, single model)

This example:
  • Loads rows from a JSONL dataset
  • Calls a model once per row using the default rollout processor
  • Scores each row from 0–1
test_math_reasoning.py
import pytest
from typing import List
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest import evaluation_test, SingleTurnRolloutProcessor


def evaluate_math_reasoning(messages) -> float:
    # Dummy scoring: in a real eval, parse the model's answer and check it
    text = (messages[-1].content or "").lower()
    return 1.0 if "4" in text else 0.0


@pytest.mark.parametrize(
    "completion_params",
    [
        {"model": "openai/gpt-4o", "temperature": 0.0},
    ],
)
@evaluation_test(
    input_dataset=["path/to/dataset.jsonl"],
    rollout_processor=SingleTurnRolloutProcessor(),
    passed_threshold=0.8,
    mode="pointwise",
)
def test_math_reasoning(row: EvaluationRow) -> EvaluationRow:
    """Score a single row between 0 and 1."""
    score = evaluate_math_reasoning(row.messages)
    row.evaluation_result.score = score
    return row
Run it with:
pytest test_math_reasoning.py
The decorator handles dataset loading, concurrency, aggregation, and summary generation for you.

Minimal mental model

  • Input: rows come from one of:
    • input_dataset (JSONL paths)
    • input_messages (inline messages)
    • input_rows (pre‑built EvaluationRows)
    • data_loaders (dynamic loaders)
  • Rollouts: controlled by:
    • completion_params (model + generation settings)
    • rollout_processor (how to talk to the model / environment)
  • Scoring:
    • Your function body writes row.evaluation_result.score in [0, 1]
    • Optionally add evaluation_result.reason and evaluation_result.metrics
If you remember “rows in → rollouts → scores out”, you’re 80% of the way there.

Using is_score_valid for Training

The is_score_valid field controls whether a rollout’s score should be used in training algorithms. When set to False, the rollout is excluded from training data (e.g., RFT, GRPO) while still being logged for analysis. When to set is_score_valid=False:
  • No assistant response exists (malformed rollout)
  • The evaluation cannot produce a meaningful score
  • External dependencies failed (e.g., tool execution errors)
  • The response is unparseable or invalid
Example: Excluding rollouts without assistant responses
def evaluate(row: EvaluationRow) -> EvaluationRow:
    # Check if the last message is from the assistant
    if not row.messages or row.messages[-1].role != "assistant":
        row.evaluation_result = EvaluateResult(
            score=0.0,
            reason="No assistant response",
            is_score_valid=False  # Exclude from training
        )
        return row
    
    # Normal evaluation logic
    score = compute_score(row)
    row.evaluation_result = EvaluateResult(
        score=score,
        reason="Evaluation completed",
        is_score_valid=True  # Include in training (default)
    )
    return row
Training pipelines (TRL, rLLM, OpenAI RFT) use is_score_valid to filter rollouts before computing gradients. A rollout with is_score_valid=False will not contribute to the loss function, preventing noisy or invalid samples from affecting model updates.

When to jump to the full reference

Stay on this page until you need:
  • data_loaders or custom adapters
  • Advanced aggregation configs
  • Custom exception handling and backoff strategies
  • Detailed environment variable behavior
When you hit those needs, the full @evaluation_test reference covers every parameter and edge case.