The open standard and toolkit for LLM evaluations
EP is an open specification with a Python SDK, UI for reviewing evals, plus popular benchmarks and integrations with observability and agent tooling. It gives you a consistent way to write evals, store traces, and save results—scaling from quick single-turn model selection to multi-turn reinforcement learning. Start with simple single-turn evals for model selection and prompt engineering, then scale up to complex multi-turn reinforcement learning (RL) for agents using Model Context Protocol (MCP). EP ensures consistent patterns for writing evals, storing traces, and saving results—enabling you to build sophisticated agent evaluations that work across real-world scenarios, from markdown generation tasks to customer service agents with tool-calling capabilities.

Getting Started

Ready to dive in? Install EP with a single command and start evaluating your models:
pip install eval-protocol

Quick Example

Here’s a simple test function that checks if a model’s response contains bold text formatting:
Before running the following example, you need to setup your environment variable to make a LiteLLM call. This example uses Fireworks (prefix: fireworks_ai/) so you need to set the FIREWORKS_API_KEY environment variable by creating a .env file in the root of your project.
.env
FIREWORKS_API_KEY=your_api_key
test_bold_format.py
from eval_protocol.models import EvaluateResult, EvaluationRow, Message
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test


@evaluation_test(
    input_messages=[
        [
            Message(
                role="system", content="You are a helpful assistant. Use bold text to highlight important information."
            ),
            Message(
                role="user", content="Explain why **evaluations** matter for building AI agents. Make it dramatic!"
            ),
        ],
    ],
    completion_params=[{"model": "fireworks_ai/accounts/fireworks/models/llama-v3p1-8b-instruct"}],
    rollout_processor=SingleTurnRolloutProcessor(),
    mode="pointwise",
)
def test_bold_format(row: EvaluationRow) -> EvaluationRow:
    """
    Simple evaluation that checks if the model's response contains bold text.
    """

    assistant_response = row.messages[-1].content

    if assistant_response is None:
        result = EvaluateResult(score=0.0, reason="❌ No response found")
        row.evaluation_result = result
        return row

    if isinstance(assistant_response, list):
        assistant_response = assistant_response[0].content

    # Check if response contains **bold** text
    has_bold = "**" in assistant_response

    if has_bold:
        result = EvaluateResult(score=1.0, reason="✅ Response contains bold text")
    else:
        result = EvaluateResult(score=0.0, reason="❌ No bold text found")

    row.evaluation_result = result
    return row

Learn More

For a complete step-by-step tutorial of a slightly more complex example with detailed explanations, dataset examples, and configuration options, see our Single-turn eval tutorial. For a more advanced example that includes MCP and user simulation, check out our implementation of 𝜏²-bench, a benchmark for evaluating conversational agents in a dual control environment.

Next Steps