Principles

Eval Protocol (EP) is built on these fundamental principles that guide every design decision and feature:

1. Evaluations as Code

EP treats evaluations as first-class code—not configuration files or ad-hoc scripts. Every evaluation is a pytest function that can be:

Parameterized and composed
Version controlled and tested
Integrated into CI/CD pipelines
Reused across different models and datasets

This approach ensures evaluations are maintainable, debuggable, and evolve with your codebase.

2. Developer Experience First

EP prioritizes developer productivity through:

Simple Integration: Write evals as pytest functions with familiar decorators
Rich Metadata: Automatic parameterization, result storage, and tooling
Flexible Data Models: Good defaults with extensibility for complex scenarios
IDE Support: Full IntelliSense, debugging, and testing integration
Local UI: A local UI to review, analyze, and identify trends in evals and rollouts in real-time

3. Standards-Based Interoperability

EP builds on existing, proven standards rather than creating new ones:

OpenAI Chat Completions API: Compatible with industry-standard model interfaces for storing trajectories in the standard dataset format
Model Context Protocol (MCP): Leverages established tool-calling standards
pytest: Integrates with the Python testing ecosystem you already know
LiteLLM: Unified access to 100+ LLM providers with OpenAI-compatible interface
Git & PEP 440: Automatic retrieval and storage of git commit data alongside eval results for version tracking

This ensures EP works with your existing tools and workflows.

4. Non-Prescriptive Architecture

EP does not prescribe how your AI systems work:

Flexible Rollout Processors: Use default processors for simple LLM calls or LLM + MCP calls, or bring your own custom implementations
Custom Integration: Write rollout processors in Python or call out to external APIs to produce evaluation inputs
System Agnostic: Works with any AI architecture, from simple chat completions to complex multi-agent systems
Extensible Design: Adapt EP to your specific use case rather than adapting your system to EP

This flexibility ensures EP can evaluate any AI system, regardless of its internal architecture or deployment strategy.

5. Open Source Foundation

EP believes open source is the only way to unify AI developers on a standard:

Community-Driven: Transparent development process with open discussions and contributions
Vendor Neutral: No lock-in to proprietary evaluation frameworks or closed ecosystems
Collective Intelligence: Leverages the entire AI community’s expertise and feedback
Sustainable Standards: Open source ensures long-term viability and adoption of evaluation standards

This commitment to openness ensures EP can become a truly universal standard that serves the entire AI development community.

6. Performance at Scale

EP is designed for production workloads:

Parallel Execution: Efficient parallel processing for large evaluation runs
Optimized for Multi-turn: Specialized handling for complex agent evaluations

7. Evolutionary Architecture

EP grows with your AI development journey:

Single-turn to Multi-turn: Start with simple model comparisons, scale to complex agent evaluations
Static to Dynamic: Begin with curated datasets, evolve to interactive environments
Evaluation to Training: Use the same rubrics for benchmarking and RL dataset generation

8. Reinforcement Learning Ready

EP is designed to bridge the gap between evaluation and training:

Per-step Rewards: Structured feedback for RL training
Environment Simulation: Realistic agent testing scenarios
User Simulation: Automated interaction testing
Data Flywheel: Turn evaluations into training data

The goal is to help developers build AI systems that improve through feedback loops, not just prompt engineering.

In essence: EP transforms evaluations from one-off tests into the foundation of your AI development loop—enabling you to build systems that learn and improve over time.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

1. Evaluations as Code

2. Developer Experience First

3. Standards-Based Interoperability

4. Non-Prescriptive Architecture

5. Open Source Foundation

6. Performance at Scale

7. Evolutionary Architecture

8. Reinforcement Learning Ready

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​1. Evaluations as Code

​2. Developer Experience First

​3. Standards-Based Interoperability

​4. Non-Prescriptive Architecture

​5. Open Source Foundation

​6. Performance at Scale

​7. Evolutionary Architecture

​8. Reinforcement Learning Ready

1. Evaluations as Code

2. Developer Experience First

3. Standards-Based Interoperability

4. Non-Prescriptive Architecture

5. Open Source Foundation

6. Performance at Scale

7. Evolutionary Architecture

8. Reinforcement Learning Ready