The open standard and toolkit for LLM evaluationsEP is an open specification with a Python SDK, UI for reviewing evals, plus popular benchmarks and integrations with observability and agent tooling. It gives you a consistent way to write evals, store traces, and save results—scaling from quick single-turn model selection to multi-turn reinforcement learning. Start with simple single-turn evals for model selection and prompt engineering, then scale up to complex multi-turn reinforcement learning (RL) for agents using Model Context Protocol (MCP). EP ensures consistent patterns for writing evals, storing traces, and saving results—enabling you to build sophisticated agent evaluations that work across real-world scenarios, from markdown generation tasks to customer service agents with tool-calling capabilities.
fireworks_ai/
) so you need to
set the FIREWORKS_API_KEY
environment variable by creating a .env
file in
the root of your project.