@evaluation_test decorator is the core component for creating pytest-based evaluation tests in the Evaluation Protocol. It enables you to evaluate AI models by running rollouts and applying evaluation criteria to measure performance.
What is an @evaluation_test?
An @evaluation_test is a pytest test with superpowers:
- Takes in rows (from a dataset, loader, or hard‑coded messages)
- Runs rollouts via a Rollout Processor (more on this on the next page)
- Evaluates and writes scores onto each row
- Aggregates results and surfaces a pass/fail signal for CI
- pytest for orchestration
- Eval Protocol for rollouts, retries, and logging
- Your function body for scoring
Smallest useful example (pointwise, single model)
This example:- Loads rows from a JSONL dataset
- Calls a model once per row using the default rollout processor
- Scores each row from 0–1
test_math_reasoning.py
The decorator handles dataset loading, concurrency, aggregation, and summary generation for you.
Minimal mental model
- Input: rows come from one of:
input_dataset(JSONL paths)input_messages(inline messages)input_rows(pre‑builtEvaluationRows)data_loaders(dynamic loaders)
- Rollouts: controlled by:
completion_params(model + generation settings)rollout_processor(how to talk to the model / environment)
- Scoring:
- Your function body writes
row.evaluation_result.scorein [0, 1] - Optionally add
evaluation_result.reasonandevaluation_result.metrics
- Your function body writes
When to jump to the full reference
Stay on this page until you need:data_loadersor custom adapters- Advanced aggregation configs
- Custom exception handling and backoff strategies
- Detailed environment variable behavior
@evaluation_test reference covers every parameter and edge case.
