With so many AI models available, choosing the right one is becoming a critical engineering decision. This quickstart shows you how to evaluate your production LLM traces using pairwise comparisons in under 5 minutes - no ground truth required!

Installation and Setup

Ready to dive in? Install EP with your preferred observability platform:
pip install 'eval-protocol[langfuse]'

# Model API keys (choose what you need)
export OPENAI_API_KEY="your_openai_key"
export FIREWORKS_API_KEY="your_fireworks_key"
export GEMINI_API_KEY="your_gemini_key"

# Platform keys
export LANGFUSE_PUBLIC_KEY="your_public_key"
export LANGFUSE_SECRET_KEY="your_secret_key"
export LANGFUSE_HOST="https://your-deployment.com"  # Optional

Run Your First Evaluation

The core LLM judge function is available as a simple import from llm_judge.py. Here’s a minimal sample for Langfuse:
from datetime import datetime
import pytest

from eval_protocol import (
    evaluation_test,
    aha_judge,
    EvaluationRow,
    SingleTurnRolloutProcessor,
    create_langfuse_adapter,
    DynamicDataLoader,
)


def langfuse_data_generator() -> list[EvaluationRow]:
    adapter = create_langfuse_adapter()
    return adapter.get_evaluation_rows(
        to_timestamp=datetime.utcnow(),
        limit=20,
        sample_size=5,
    )


@pytest.mark.parametrize(
    "completion_params",
    [
        {"model": "gpt-4.1"},
        {"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"},
    ],
)
@evaluation_test(
    data_loaders=DynamicDataLoader(
        generators=[langfuse_data_generator],
    ),
    rollout_processor=SingleTurnRolloutProcessor(),
)
async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
    return await aha_judge(row)
We provide example implementations for you to run immediately. All you need to change is the parameters in adapter.get_evaluation_rows. See more in the Integrations section for more information on your choice of tracing.
Full example at llm_judge_langfuse.py.Run it using the command:
pytest llm_judge_langfuse.py -v -s

Viewing the Results

At the bottom of your pytest, you’ll see a link. Clicking on it, you’ll be taken to a local leaderboard:
================================================================================
📊 LOCAL UI EVALUATION RESULTS
================================================================================
📊 Invocation messy-party-41:
  📊 Aggregate scores: http://localhost:8000/pivot?filterConfig=%5B%7B%22logic%22%3A%20%22AND%22%2C%20%22filters%22%3A%20%5B%7B%22field%22%3A%20%22%24.execution_metadata.invocation_id%22%2C%20%22operator%22%3A%20%22%3D%3D%22%2C%20%22value%22%3A%20%22messy-party-41%22%2C%20%22type%22%3A%20%22text%22%7D%5D%7D%5D
  📋 Trajectories: http://localhost:8000/table?filterConfig=%5B%7B%22logic%22%3A%20%22AND%22%2C%20%22filters%22%3A%20%5B%7B%22field%22%3A%20%22%24.execution_metadata.invocation_id%22%2C%20%22operator%22%3A%20%22%3D%3D%22%2C%20%22value%22%3A%20%22messy-party-41%22%2C%20%22type%22%3A%20%22text%22%7D%5D%7D%5D
================================================================================
Quickstart Leaderboard

Local leaderboard showing model comparison results

Check out what else you can do in the pivot view, including analyzing cost and time metrics.

How It Works: Arena-Hard-Auto Methodology

Arena-Hard-Auto is a pairwise comparison approach where:
  • Two models respond to the same prompt
  • An LLM judge compares the responses
  • Win rates are calculated across many comparisons
  • No ground truth is needed - just relative quality assessment

Customization Options

Filter Your Traces

For advanced trace filtering options (by tags, users, time ranges, metadata, etc.), check out the Integrations section.

Try Different Models

# Modify the completion_params in your @evaluation_test:
completion_params=[
    {"model": "gpt-5"},
    {"model": "anthropic/claude-4"},
    {"model": "fireworks_ai/accounts/fireworks/models/qwen3-235b-a22b-instruct-2507"},
],
Note that for completion_params, these are passed into LiteLLM, so some models will need prefixes based on the provider.

Change the Judge

# Available judges in JUDGE_CONFIGS:
judge_name = "kimi-k2-instruct-0905"  # Fireworks Kimi model (default)
judge_name = "gemini-2.5-pro"         # Google Gemini Pro
judge_name = "gpt-4.1"                # OpenAI GPT-4.1
judge_name = "gemini-2.5-flash"       # Google Gemini Flash (faster)
Each judge has optimized settings for temperature, token limits, and concurrency based on Arena-Hard-Auto research.

Adjust Concurrency for Speed

You can make evaluations faster by increasing concurrency, but be careful of rate limits:
@evaluation_test(
    data_loaders=DynamicDataLoader(
        generators=[langfuse_data_generator],
    ),
    completion_params=[{"model": "gpt-4.1"}],
    rollout_processor=SingleTurnRolloutProcessor(),
    max_concurrent_rollouts=16,      # Increase for faster candidate model responses (from completion_params)
    max_concurrent_evaluations=4,    # Increase for faster judging (e.g. gemini-2.5-pro or kimi-k2-instruct-0905)
)
Setting concurrency too high may result in 429 rate limit errors. If you encounter rate limiting, reduce these values. For more troubleshooting, see Common Errors.

Extract More Test Cases with Preprocessing

Get more evaluation data from multi-turn conversations using built-in preprocessing functions:
from eval_protocol import multi_turn_assistant_to_ground_truth, assistant_to_ground_truth

@evaluation_test(
    data_loaders=DynamicDataLoader(
        generators=[langfuse_data_generator],
        preprocess_fn=multi_turn_assistant_to_ground_truth,  # Recommended: creates one test per assistant turn
        # preprocess_fn=assistant_to_ground_truth,          # Alternative: only uses the last assistant turn
    ),
    completion_params=[{"model": "gpt-4.1"}],
    rollout_processor=SingleTurnRolloutProcessor(),
)
multi_turn_assistant_to_ground_truth (recommended):
  • Splits each multi-turn trace into multiple evaluation rows
  • Each assistant response becomes ground truth for comparison
  • Gets more test cases from fewer traces
assistant_to_ground_truth:
  • Uses only the final assistant response as ground truth
  • Good for evaluating conversation conclusions
You can also write custom preprocessing functions to fit your specific use case.

Real-World Example: Validating Against Tau-Bench

To understand this method more in depth and our own validation of it, check out our blog post here! TODO: link blog.

You’ve just built your first model leaderboard!

Most importantly, no ground truth or manual labeling required - just your existing conversation traces! Now you have objective data to choose the right model for your use case:
  • Test on your real conversations - See how models perform on your specific domain
  • Get statistical confidence - Win rates with confidence intervals, not gut feelings
  • Make cost-performance trade-offs - Balance quality against API pricing
  • Deploy strategically - Use different models where they excel most
Your leaderboard becomes a living tool that evolves with new models and use cases. Stop guessing. Start measuring.