Discrete(4)
- Move left (0), down (1), right (2), up (3)Discrete(16)
- Grid positions 0-15S
(Start), F
(Frozen/safe), H
(Hole/lose), G
(Goal/win)id
: Unique identifier for the evaluation runsystem_prompt
: Detailed instructions explaining the game rules and interaction methoduser_prompt_template
: Template for presenting the current game state to the agent, {observation}
gets replaced with current grid stateenvironment_context
: Configuration parameters for the Frozen Lake environmentFrozenLakeMcp
and FrozenLakeAdapter
.
FrozenLakeMcp
class inherits from McpGym
and creates an MCP server that agents can interact with:
lake_move
tool with simple string actionsFrozenLakeAdapter
handles the actual Gymnasium environment operations:
frozen_lake_to_evaluation_row
function converts the simple Frozen Lake dataset entries into the framework’s EvaluationRow
format:
@evaluation_test
decorator configures the RL evaluation with Frozen Lake-specific parameters:
default_mcp_gym_rollout_processor
is the same processor used in the τ²-bench evaluation, demonstrating how eval-protocol provides reusable components that work seamlessly across different evaluation types—from conversational agents to RL environments.
lake_move
tool callstotal_reward
test_markdown_highlighting_evaluation
introduced earlier, this multi-turn example with per-step rewards showcases the framework’s full capabilities. Specifically, it demonstrates how Eval Protocol generates detailed rollout data enriched with reward signals, which can directly inform reinforcement learning and fine-tuning processes.
Per-step rewards recorded throughout each Frozen Lake episode are not merely for assessment; they form structured training data. The protocol aggregates these step-by-step rewards (assigning 0.0 for each frozen tile encountered and +1.0 for successfully reaching the goal) into trajectory-level scores. This nuanced scoring provides sophisticated training signals: reward sequences can be directly leveraged by training algorithms like PPO or GRPO, or any learning method that benefits from structured, sequential feedback.
Eval Protocol thus transforms an evaluation suite from a passive testing mechanism into an active engine for dynamic data generation, facilitating every stage of the LLM software development lifecycle—from model selection and prompt refinement to ongoing evaluation, debugging, and continuous improvement. Its vision is straightforward: define evaluation criteria once in code and reuse them universally—for benchmarking, CI/CD processes, dataset creation, and iterative training. The Frozen Lake tutorial exemplifies how a unified evaluation framework can bridge traditional reinforcement learning environments with contemporary LLM-driven agents, laying the groundwork for continuously improving AI systems.