@evaluation_test decorator is the core component for creating pytest-based evaluation tests in the Evaluation Protocol. It enables you to evaluate AI models by running rollouts and applying evaluation criteria to measure performance.
What is an @evaluation_test?
An @evaluation_test is a pytest test with superpowers:
- Takes in rows (from a dataset, loader, or hard‑coded messages)
- Runs rollouts via a Rollout Processor (more on this on the next page)
- Evaluates and writes scores onto each row
- Aggregates results and surfaces a pass/fail signal for CI
- pytest for orchestration
- Eval Protocol for rollouts, retries, and logging
- Your function body for scoring
Smallest useful example (pointwise, single model)
This example:- Loads rows from a JSONL dataset
- Calls a model once per row using the default rollout processor
- Scores each row from 0–1
test_math_reasoning.py
The decorator handles dataset loading, concurrency, aggregation, and summary generation for you.
Minimal mental model
- Input: rows come from one of:
input_dataset(JSONL paths)input_messages(inline messages)input_rows(pre‑builtEvaluationRows)data_loaders(dynamic loaders)
- Rollouts: controlled by:
completion_params(model + generation settings)rollout_processor(how to talk to the model / environment)
- Scoring:
- Your function body writes
row.evaluation_result.scorein [0, 1] - Optionally add
evaluation_result.reasonandevaluation_result.metrics
- Your function body writes
Using is_score_valid for Training
The is_score_valid field controls whether a rollout’s score should be used in training algorithms. When set to False, the rollout is excluded from training data (e.g., RFT, GRPO) while still being logged for analysis.
When to set is_score_valid=False:
- No assistant response exists (malformed rollout)
- The evaluation cannot produce a meaningful score
- External dependencies failed (e.g., tool execution errors)
- The response is unparseable or invalid
Training pipelines (TRL, rLLM, OpenAI RFT) use
is_score_valid to filter rollouts before computing gradients. A rollout with is_score_valid=False will not contribute to the loss function, preventing noisy or invalid samples from affecting model updates.When to jump to the full reference
Stay on this page until you need:data_loadersor custom adapters- Advanced aggregation configs
- Custom exception handling and backoff strategies
- Detailed environment variable behavior
@evaluation_test reference covers every parameter and edge case.
