No hand-written evals required. Build a data-driven model leaderboard from your existing traces.
With so many AI models available, choosing the right one is becoming a critical engineering decision. This quickstart shows you how to evaluate your production LLM traces using pairwise comparisons in under 5 minutes - no ground truth required!
Ready to dive in? Install EP with your preferred observability platform:
Copy
Ask AI
pip install 'eval-protocol[langfuse]'# Model API keys (choose what you need)export OPENAI_API_KEY="your_openai_key"export FIREWORKS_API_KEY="your_fireworks_key"export GEMINI_API_KEY="your_gemini_key"# Platform keysexport LANGFUSE_PUBLIC_KEY="your_public_key"export LANGFUSE_SECRET_KEY="your_secret_key"export LANGFUSE_HOST="https://your-deployment.com" # Optional
We provide example implementations for you to run immediately. All you need to change is the parameters in adapter.get_evaluation_rows. See more in the Integrations section for more information on your choice of tracing.
# Modify the completion_params in your @evaluation_test:completion_params=[ {"model": "gpt-5"}, {"model": "anthropic/claude-4"}, {"model": "fireworks_ai/accounts/fireworks/models/qwen3-235b-a22b-instruct-2507"},],
Note that for completion_params, these are passed into LiteLLM, so some models will need prefixes based on the provider.
# Available judges in JUDGE_CONFIGS:judge_name = "kimi-k2-instruct-0905" # Fireworks Kimi model (default)judge_name = "gemini-2.5-pro" # Google Gemini Projudge_name = "gpt-4.1" # OpenAI GPT-4.1judge_name = "gemini-2.5-flash" # Google Gemini Flash (faster)
Each judge has optimized settings for temperature, token limits, and concurrency based on Arena-Hard-Auto research.
You can make evaluations faster by increasing concurrency, but be careful of rate limits:
Copy
Ask AI
@evaluation_test( data_loaders=DynamicDataLoader( generators=[langfuse_data_generator], ), completion_params=[{"model": "gpt-4.1"}], rollout_processor=SingleTurnRolloutProcessor(), max_concurrent_rollouts=16, # Increase for faster candidate model responses (from completion_params) max_concurrent_evaluations=4, # Increase for faster judging (e.g. gemini-2.5-pro or kimi-k2-instruct-0905))
Setting concurrency too high may result in 429 rate limit errors. If you encounter rate limiting, reduce these values. For more troubleshooting, see Common Errors.
Get more evaluation data from multi-turn conversations using built-in preprocessing functions:
Copy
Ask AI
from eval_protocol import multi_turn_assistant_to_ground_truth, assistant_to_ground_truth@evaluation_test( data_loaders=DynamicDataLoader( generators=[langfuse_data_generator], preprocess_fn=multi_turn_assistant_to_ground_truth, # Recommended: creates one test per assistant turn # preprocess_fn=assistant_to_ground_truth, # Alternative: only uses the last assistant turn ), completion_params=[{"model": "gpt-4.1"}], rollout_processor=SingleTurnRolloutProcessor(),)
Most importantly, no ground truth or manual labeling required - just your existing conversation traces!Now you have objective data to choose the right model for your use case:
Test on your real conversations - See how models perform on your specific domain
Get statistical confidence - Win rates with confidence intervals, not gut feelings
Make cost-performance trade-offs - Balance quality against API pricing
Deploy strategically - Use different models where they excel most
Your leaderboard becomes a living tool that evolves with new models and use cases. Stop guessing. Start measuring.