
Example quality scores across 4 different models using the AIME benchmark implemented in eval-protocol.
@evaluation_test
decorator: a pytest-compatible decorator for configuring and authoring evaluations- Rollout processing engine that reliably handles flaky LLM APIs and parallelism for long-running evaluations
- Built-in integrations with LLM observability vendors such as Braintrust, Langfuse, LangSmith, and Responses API
- Native support for agent frameworks like LangGraph and Pydantic AI
- MCP-based framework for building reinforcement learning (RL) environments
- Local UI and storage for reviewing, analyzing, and inspecting evaluations and rollouts in real time
- Opinionated framework for separating configuration, data collection, rollout processing, and evaluation into separate concerns
- Built-in benchmarks, including AIME and tau-bench
- Built-in evaluator that enables stack-ranking of models using LLMs as judges (leveraging only model traces)
- Built-in statistical methods for aggregating eval results
- Unified data model for processing, running, and storing evaluations