> ## Documentation Index
> Fetch the complete documentation index at: https://evalprotocol.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation Tests (Getting Started)

> Write your first @evaluation_test in a few minutes, then scale up to real benchmarks.

The [`@evaluation_test`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/pytest/evaluation_test.py) decorator is the core component for creating pytest-based evaluation tests in the Evaluation Protocol. It enables you to evaluate AI models by running rollouts and applying evaluation criteria to measure performance.

## What is an `@evaluation_test`?

An `@evaluation_test` is a **pytest test with superpowers**:

* **Takes in rows** (from a dataset, loader, or hard‑coded messages)
* **Runs rollouts via a [Rollout Processor](/tutorial/rollout-processors-getting-started)** (more on this on the next page)
* **Evaluates and writes scores** onto each row
* **Aggregates results** and surfaces a pass/fail signal for CI

You can think of it as:

* pytest for orchestration
* Eval Protocol for rollouts, retries, and logging
* Your function body for scoring

For the full API, see the [reference page](/reference/evaluation-test). This guide focuses on the **shortest path to something useful**.

## Smallest useful example (pointwise, single model)

This example:

* Loads rows from a JSONL dataset
* Calls a model once per row using the default rollout processor
* Scores each row from 0–1

```python test_math_reasoning.py theme={null}
import pytest
from typing import List
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest import evaluation_test, SingleTurnRolloutProcessor


def evaluate_math_reasoning(messages) -> float:
    # Dummy scoring: in a real eval, parse the model's answer and check it
    text = (messages[-1].content or "").lower()
    return 1.0 if "4" in text else 0.0


@pytest.mark.parametrize(
    "completion_params",
    [
        {"model": "openai/gpt-4o", "temperature": 0.0},
    ],
)
@evaluation_test(
    input_dataset=["path/to/dataset.jsonl"],
    rollout_processor=SingleTurnRolloutProcessor(),
    passed_threshold=0.8,
    mode="pointwise",
)
def test_math_reasoning(row: EvaluationRow) -> EvaluationRow:
    """Score a single row between 0 and 1."""
    score = evaluate_math_reasoning(row.messages)
    row.evaluation_result.score = score
    return row
```

Run it with:

```bash theme={null}
pytest test_math_reasoning.py
```

> The decorator handles dataset loading, concurrency, aggregation, and summary generation for you.

## Minimal mental model

* **Input**: rows come from **one** of:
  * `input_dataset` (JSONL paths)
  * `input_messages` (inline messages)
  * `input_rows` (pre‑built `EvaluationRow`s)
  * `data_loaders` (dynamic loaders)
* **Rollouts**: controlled by:
  * `completion_params` (model + generation settings)
  * `rollout_processor` (how to talk to the model / environment)
* **Scoring**:
  * Your function body writes `row.evaluation_result.score` in \[0, 1]
  * Optionally add `evaluation_result.reason` and `evaluation_result.metrics`

If you remember **"rows in → rollouts → scores out"**, you're 80% of the way there.

## Using `is_score_valid` for Training

The `is_score_valid` field controls whether a rollout's score should be used in training algorithms. When set to `False`, the rollout is excluded from training data (e.g., RFT, GRPO) while still being logged for analysis.

**When to set `is_score_valid=False`:**

* No assistant response exists (malformed rollout)
* The evaluation cannot produce a meaningful score
* External dependencies failed (e.g., tool execution errors)
* The response is unparseable or invalid

**Example: Excluding rollouts without assistant responses**

```python theme={null}
def evaluate(row: EvaluationRow) -> EvaluationRow:
    # Check if the last message is from the assistant
    if not row.messages or row.messages[-1].role != "assistant":
        row.evaluation_result = EvaluateResult(
            score=0.0,
            reason="No assistant response",
            is_score_valid=False  # Exclude from training
        )
        return row
    
    # Normal evaluation logic
    score = compute_score(row)
    row.evaluation_result = EvaluateResult(
        score=score,
        reason="Evaluation completed",
        is_score_valid=True  # Include in training (default)
    )
    return row
```

<Note>
  Training pipelines (TRL, rLLM, OpenAI RFT) use `is_score_valid` to filter rollouts before computing gradients. A rollout with `is_score_valid=False` will not contribute to the loss function, preventing noisy or invalid samples from affecting model updates.
</Note>

## When to jump to the full reference

Stay on this page until you need:

* `data_loaders` or custom adapters
* Advanced **aggregation** configs
* Custom **exception handling** and backoff strategies
* Detailed environment variable behavior

When you hit those needs, the [full `@evaluation_test` reference](/reference/evaluation-test) covers every parameter and edge case.
