Eval Protocol (EP) is built on these fundamental principles that guide every design decision and feature:

1. Evaluations as Code

EP treats evaluations as first-class code—not configuration files or ad-hoc scripts. Every evaluation is a pytest function that can be:
  • Parameterized and composed
  • Version controlled and tested
  • Integrated into CI/CD pipelines
  • Reused across different models and datasets
This approach ensures evaluations are maintainable, debuggable, and evolve with your codebase.

2. Developer Experience First

EP prioritizes developer productivity through:
  • Simple Integration: Write evals as pytest functions with familiar decorators
  • Rich Metadata: Automatic parameterization, result storage, and tooling
  • Flexible Data Models: Good defaults with extensibility for complex scenarios
  • IDE Support: Full IntelliSense, debugging, and testing integration
  • Local UI: A local UI to review, analyze, and identify trends in evals and rollouts in real-time

3. Standards-Based Interoperability

EP builds on existing, proven standards rather than creating new ones:
  • OpenAI Chat Completions API: Compatible with industry-standard model interfaces for storing trajectories in the standard dataset format
  • Model Context Protocol (MCP): Leverages established tool-calling standards
  • pytest: Integrates with the Python testing ecosystem you already know
  • LiteLLM: Unified access to 100+ LLM providers with OpenAI-compatible interface
  • Git & PEP 440: Automatic retrieval and storage of git commit data alongside eval results for version tracking
This ensures EP works with your existing tools and workflows.

4. Non-Prescriptive Architecture

EP does not prescribe how your AI systems work:
  • Flexible Rollout Processors: Use default processors for simple LLM calls or LLM + MCP calls, or bring your own custom implementations
  • Custom Integration: Write rollout processors in Python or call out to external APIs to produce evaluation inputs
  • System Agnostic: Works with any AI architecture, from simple chat completions to complex multi-agent systems
  • Extensible Design: Adapt EP to your specific use case rather than adapting your system to EP
This flexibility ensures EP can evaluate any AI system, regardless of its internal architecture or deployment strategy.

5. Open Source Foundation

EP believes open source is the only way to unify AI developers on a standard:
  • Community-Driven: Transparent development process with open discussions and contributions
  • Vendor Neutral: No lock-in to proprietary evaluation frameworks or closed ecosystems
  • Collective Intelligence: Leverages the entire AI community’s expertise and feedback
  • Sustainable Standards: Open source ensures long-term viability and adoption of evaluation standards
This commitment to openness ensures EP can become a truly universal standard that serves the entire AI development community.

6. Performance at Scale

EP is designed for production workloads:
  • Parallel Execution: Efficient parallel processing for large evaluation runs
  • Optimized for Multi-turn: Specialized handling for complex agent evaluations

7. Evolutionary Architecture

EP grows with your AI development journey:
  • Single-turn to Multi-turn: Start with simple model comparisons, scale to complex agent evaluations
  • Static to Dynamic: Begin with curated datasets, evolve to interactive environments
  • Evaluation to Training: Use the same rubrics for benchmarking and RL dataset generation

8. Reinforcement Learning Ready

EP is designed to bridge the gap between evaluation and training:
  • Per-step Rewards: Structured feedback for RL training
  • Environment Simulation: Realistic agent testing scenarios
  • User Simulation: Automated interaction testing
  • Data Flywheel: Turn evaluations into training data
The goal is to help developers build AI systems that improve through feedback loops, not just prompt engineering.
In essence: EP transforms evaluations from one-off tests into the foundation of your AI development loop—enabling you to build systems that learn and improve over time.