Use this file to discover all available pages before exploring further.
To make it easy to build a model leaderboard for Pydantic AI agents,
eval-protocol provides an out-of-the-box rollout processor for Pydantic AI
agents.
This orchestrates rollouts for Pydantic AI agents so you only need to pass an
agent factory function and eval-protocol will handle running your experiments
against your dataset.
To supply an agent for evaluation, you need to define an agent factory function.
An agent factory is a function of type Callable[[RolloutProcessorConfig], Agent]. See reference for more details.In this example, we assume you have a setup_agent function that creates a
Pydantic AI agent using a given model.
from pydantic_ai import Agentfrom pydantic_ai.models.openai import OpenAIChatModelfrom eval_protocol.pytest.types import RolloutProcessorConfigdef agent_factory(config: RolloutProcessorConfig) -> Agent: model_name = config.completion_params["model"] # Provider is optional - defaults to "openai" if not specified provider = config.completion_params.get("provider", "openai") model = OpenAIChatModel(model_name, provider=provider) return setup_agent(model)
Use the completion_params to get the model name. The provider field is optional
and defaults to “openai” if not specified. The model field is the canonical way to
pass the model name to most LLM clients.
For illustration, let’s build an AI agent to help answer questions about the
Chinook database, an open-source
sample database that represents a digtal media store, including tables for
artists, albums, tracks, invoices, and customers.
Our Pydantic AI agent
(source)
has access to the database through the provided execute_sql tool and the
entire database schema is injected into the system prompt. The agent should be
able to use the tool to query data and summarize the results to help answer
questions about the dataset.
Before creating your eval, you will need to parameterize your agent so that you
can evaluate it with different models. In the this example we wrap our agent
creation logic in a function called setup_agent that accepts a pydantic
Model
object. You
can reuse this pattern in your own setup, but it is not required.
To evaluate our agent, we curated a set of complex tasks and their ground truth
answers that we can use to evaluate the quality of the agent (dataset
here).For example, here is one of the tasks for our eval:
Find the top 5 customers by total spending, including their favorite genre. Showcustomer name, favorite genre, total invoices, total spent, and spending rank.
And here is the ground truth answer:
customer_name
favorite_genre
total_invoices
total_spent
spending_rank
Helena Holý
Rock
7
49.62
1
Richard Cunningham
Rock
7
47.62
2
Luis Rojas
Rock
7
46.62
3
Ladislav Kovács
Rock
7
45.62
4
Hugh O’Reilly
Rock
7
45.62
4
For each task, the agent should be evaluated on its ability to collect data
through SQL calls and pass through or give a high-quality summary of the correct
data for the task.
Evals in eval-protocol return a score between 0.0 and 1.0. For this example,
we will give either a score of 0 or 1 depending on whether the final answer from
the agent contains the same or well summarized information as the data shown in
the ground truth.
Every eval in eval-protocol expects an input dataset of type List[EvaluationRow].In our example, we define a collect_dataset
function
that helps us read tasks and ground truth answers from the dataset folder.
Evals in eval-protocol return a score between 0.0 and 1.0. For this example,
we use an LLM-based judge to compare the agent’s response against the ground truth answer.Here’s the complete scoring implementation from the test:
import osfrom pydantic import BaseModelfrom pydantic_ai import Agentfrom pydantic_ai.models.openai import OpenAIModelimport pytestfrom eval_protocol.models import EvaluateResult, EvaluationRowfrom eval_protocol.pytest import evaluation_testfrom eval_protocol.pytest.types import RolloutProcessorConfigfrom tests.chinook.dataset import collect_datasetfrom tests.chinook.pydantic.agent import setup_agentfrom tests.pytest.test_pydantic_agent import PydanticAgentRolloutProcessorLLM_JUDGE_PROMPT = ( "Your job is to compare the response to the expected answer.\n" "The response will be a narrative report of the query results.\n" "If the response contains the same or well summarized information as the expected answer, return 1.0.\n" "If the response does not contain the same information or is missing information, return 0.0.")def agent_factory(config: RolloutProcessorConfig) -> Agent: model_name = config.completion_params["model"] provider = config.completion_params.get("provider", "openai") model = OpenAIChatModel(model_name, provider=provider) return setup_agent(model)@pytest.mark.skipif( os.environ.get("CI") == "true", reason="Only run this test locally (skipped in CI)",)@pytest.mark.asyncio@evaluation_test( input_rows=[collect_dataset()], completion_params=[ { "model": "accounts/fireworks/models/kimi-k2-instruct", "provider": "fireworks", # Optional: defaults to "openai" }, ], rollout_processor=PydanticAgentRolloutProcessor(agent_factory),)async def test_pydantic_complex_queries(row: EvaluationRow) -> EvaluationRow: """ Evaluation of complex queries for the Chinook database using PydanticAI """ last_assistant_message = row.last_assistant_message() if last_assistant_message is None: row.evaluation_result = EvaluateResult( score=0.0, reason="No assistant message found", ) elif not last_assistant_message.content: row.evaluation_result = EvaluateResult( score=0.0, reason="No assistant message found", ) else: model = OpenAIModel( "accounts/fireworks/models/kimi-k2-instruct", provider="fireworks", ) class Response(BaseModel): """ A score between 0.0 and 1.0 indicating whether the response is correct. """ score: float """ A short explanation of why the response is correct or incorrect. """ reason: str comparison_agent = Agent( model=model, system_prompt=LLM_JUDGE_PROMPT, output_type=Response, output_retries=5, ) result = await comparison_agent.run( f"Expected answer: {row.ground_truth}\nResponse: {last_assistant_message.content}" ) row.evaluation_result = EvaluateResult( score=result.output.score, reason=result.output.reason, ) return row
To compare different models, modify the completion_params in your @evaluation_test decorator to include multiple models. Set num_runs=3 to generate multiple samples per row, providing more robust evaluation results by running each test case 3 times:
After running the evaluation, you can analyze the results using the Pivot View. The pivot view allows you to:
Compare model performance across different metrics
Create visualizations and charts
Export results as images or CSV files
Filter and aggregate data by various dimensions
The leaderboard shows that kimi-k2-instruct-0905 and kimi-k2-instruct models perform best on the complex queries evaluation, significantly outperforming other models in the comparison.