Evaluations as Code
At the heart of EP is a simple but powerful idea: rubrics are just pytest functions. Each rubric specifies how to run a model on a task and how to score the result. These functions can be:- Parameterized over models, datasets, or configurations
- Composed together to run controlled experiments
- Used to generate datasets for supervised fine-tuning or reinforcement learning
One Format, Two Modes
EP supports both static and dynamic evaluation:-
Static Evals
Evaluate models on curated test sets (e.g., prompt-response pairs with expected outputs). Use this to compare models, debug regressions, or gate releases. These evals can be versioned and automatically run in CI. -
Dynamic Evals
Evaluate agents as they interact with tools or environments over multiple steps. Each step logs the observation, action (e.g., a tool call), and feedback from the rubric. These reward traces can be used for reinforcement learning (e.g., PPO or GRPO), or analyzed post-hoc to improve agent design.
Rubrics Are the New Prompts
Just as prompt engineering shapes model behavior before inference, rubric engineering shapes behavior after the fact—by defining how to measure success. EP makes it easy to iterate on rubrics just like prompts. Rubrics can return:- Scalar rewards (for RL)
- Structured error information
- Partial credit scoring (for tasks like math, tool use, etc.)
- Assertions and failure modes
Reference: RL for Agents
In his talk on reinforcement learning for agents, Will Brown emphasizes that future agents won’t emerge from prompt chaining alone—they require feedback loops. He argues that capabilities like tool use and self-correction only arise when models are trained in environments with structured rewards and rich instrumentation. EP was designed with that exact workflow in mind: rubrics that double as rewards, environments that log every step, and infrastructure that makes it trivial to go from test case to fine-tuning dataset to full RL training run.Build the Feedback Loops You Need
With EP, you get one evaluation framework that supports:- ✅ Offline benchmarking (for model selection and quality tracking)
- ✅ Dataset generation (for SFT and RLHF)
- ✅ Structured instrumentation (for tool-using agents and planners)
- ✅ Reward shaping and debugging (for custom training workflows)
- ✅ CI integration (so regressions are caught early)

