You can find the complete code for this example at test_frozen_lake.py.
Understanding the Frozen Lake Environment
Frozen Lake is a classic RL environment where an agent navigates a 4x4 grid from start to goal without falling into holes.- Action Space:
Discrete(4)- Move left (0), down (1), right (2), up (3) - Observation Space:
Discrete(16)- Grid positions 0-15 - Grid Layout: 4x4 grid with
S(Start),F(Frozen/safe),H(Hole/lose),G(Goal/win)
Understanding the Dataset Structure
The Frozen Lake dataset is much simpler than conversational agent datasets - it focuses purely on setting up the RL environment and providing clear instructions for agent interaction.Dataset Format
Each entry contains three main components for configuring the RL episode:id: Unique identifier for the evaluation runsystem_prompt: Detailed instructions explaining the game rules and interaction methoduser_prompt_template: Template for presenting the current game state to the agent,{observation}gets replaced with current grid stateenvironment_context: Configuration parameters for the Frozen Lake environment
Example Dataset Entry
Test Harness Architecture (RL Gym + Environment Integration)
Now we can explain the adapter pattern mentioned earlier - the eval-protocol framework provides a clean bridge between standard Gymnasium environments and the MCP evaluation system through two key components:FrozenLakeMcp and FrozenLakeAdapter.
MCP Server: FrozenLakeMcp
TheFrozenLakeMcp class inherits from McpGym and creates an MCP server that agents can interact with:
- Single Tool Interface: Agents interact through the
lake_movetool with simple string actions - Session Management: Each evaluation gets isolated environment sessions
- Action Validation: Converts string actions (LEFT, DOWN, RIGHT, UP) to environment integers
- Data Plane: Returns only observation data; control plane (rewards, termination) managed server-side
Environment Adapter: FrozenLakeAdapter
TheFrozenLakeAdapter handles the actual Gymnasium environment operations:
Bridging Standard Gym with MCP
This architecture bridges two different paradigms: Standard Gymnasium:- Integer action spaces (0, 1, 2, 3)
- Numeric observations (position 0-15)
- Direct step/reset methods
- Per-step rewards and termination flags
- String-based tool calls (“LEFT”, “DOWN”, etc.)
- JSON-formatted observations with grid rendering
- Session-based interactions
- Server-managed control plane (rewards handled separately)
Session Isolation and Multi-Evaluation
The framework provides robust session management:Pytest Implementation
The Frozen Lake evaluation integrates with the eval-protocol pytest framework through a streamlined test function that leverages the MCP Gym infrastructure and per-step rewards we’ve discussed.Step 1: Dataset Adapter
Thefrozen_lake_to_evaluation_row function converts the simple Frozen Lake dataset entries into the framework’s EvaluationRow format:
Step 2: Test Configuration
The@evaluation_test decorator configures the RL evaluation with Frozen Lake-specific parameters:
default_mcp_gym_rollout_processor is the same processor used in the τ²-bench evaluation, demonstrating how eval-protocol provides reusable components that work seamlessly across different evaluation types—from conversational agents to RL environments.
Step 3: Trajectory Evaluation Function
The test function demonstrates the power of per-step reward evaluation:- Binary Success Evaluation: Unlike complex conversational evaluations, this is simple: either the agent reached the goal (score=1.0) or it didn’t (score=0.0)
- Intrinsic Environment Rewards: The evaluation function doesn’t need to implement complex scoring, it just uses the environment’s intrinsic reward structure that was captured during the MCP Gym rollout
- Trajectory-Level Assessment: The framework automatically handles the multi-turn interaction, reward aggregation, and trajectory completion, so the evaluation function only needs to interpret the final aggregated score
Integration with MCP Gym Framework
This demonstrates the complete integration flow:- Dataset Entry: Specifies environment configuration and agent instructions
- MCP Server Launch: Framework starts the FrozenLakeMcp server automatically
- Multi-Turn Rollout: Agent interacts with environment through
lake_movetool calls - Per-Step Reward Capture: Framework records 0.0 or +1.0 at each step
- Trajectory Aggregation: Framework sums all per-step rewards into
total_reward - Simple Evaluation: Test function interprets the aggregated score
Conclusion
This showcases how eval-protocol transforms complex multi-turn RL environments into simple, reusable evaluation functions while maintaining the rich per-step reward information needed for training data generation. But more than that, this Frozen Lake tutorial illustrates a fundamental principle of Eval Protocol: building essential feedback loops for modern AI development. While initial evaluations might be as straightforward as thetest_markdown_highlighting_evaluation introduced earlier, this multi-turn example with per-step rewards showcases the framework’s full capabilities. Specifically, it demonstrates how Eval Protocol generates detailed rollout data enriched with reward signals, which can directly inform reinforcement learning and fine-tuning processes.
Per-step rewards recorded throughout each Frozen Lake episode are not merely for assessment; they form structured training data. The protocol aggregates these step-by-step rewards (assigning 0.0 for each frozen tile encountered and +1.0 for successfully reaching the goal) into trajectory-level scores. This nuanced scoring provides sophisticated training signals: reward sequences can be directly leveraged by training algorithms like PPO or GRPO, or any learning method that benefits from structured, sequential feedback.
Eval Protocol thus transforms an evaluation suite from a passive testing mechanism into an active engine for dynamic data generation, facilitating every stage of the LLM software development lifecycle—from model selection and prompt refinement to ongoing evaluation, debugging, and continuous improvement. Its vision is straightforward: define evaluation criteria once in code and reuse them universally—for benchmarking, CI/CD processes, dataset creation, and iterative training. The Frozen Lake tutorial exemplifies how a unified evaluation framework can bridge traditional reinforcement learning environments with contemporary LLM-driven agents, laying the groundwork for continuously improving AI systems.
