You can find the complete code for this example at test_tau_bench_airline.py.
Understanding πΒ²-bench
Whatβs uniquely challenging and useful about the πΒ²-benchmark is the use of a simulated user alongside the agent being evaluated. This setup has a few key components:- Agent: The AI system being evaluated, which must follow domain-specific policies and use available MCP tools to interact with the environment
- Simulated User: An AI-powered user that generates realistic customer requests, responses, and conversational behavior
- Environment: A simulated business system (airline, retail, telecom) that the agent interacts with through tool calls
Understanding the Airline Dataset
πΒ²-bench includes multiple business domains, each with distinct characteristics. This example focuses specifically on the airline domain, which provides the richest scenarios for demonstrating simulated user interactions. Other domains include retail, mock, and telecom.Dataset Format
Each entry in the airline dataset contains:id
: Unique identifier for the task scenariouser_prompt_template
: Template for presenting information to the simulated userenvironment_context
: Domain specification and environmental settingsuser_simulation
: Complete definition of the simulated userβs behavior, knowledge, and personalityevaluation_criteria
: Specific actions, communications, and assertions the agent must fulfill
Example Dataset Entry
Baggage Allowance Inquiry Scenario:Evaluation Criteria
The airline domain uses four distinct evaluation criteria to comprehensively assess agent performance:-
Tool Action Verification: Checks if the agent calls the specific tool actions listed in the
"actions"
array with the correct parameters -
Communication Validation: Verifies that the agent communicated to the simulated user strictly what is specified in
"communicate_info"
(e.g., the number β4β for suitcase allowance) -
Natural Language Assertions: Uses LLM-as-a-judge to evaluate the
"nl_assertions"
- complex behavioral requirements like βAgent detects that user is actually a Silver memberβ and proper policy application - Database State Verification: Creates a hash over the database to ensure it remains in the correct state after all interactions, validating that no unintended changes occurred during the conversation
Test Harness Architecture (MCP Gym + Environment)
The πΒ²-bench airline evaluation uses the MCP Gym framework to create realistic business simulations. The implementation consists of two main components: theAirlineDomainMcp
server that handles MCP tool calls, and the AirlineEnvironment
that manages the actual airline business logic.
Airline Domain MCP Server
TheAirlineDomainMcp
class inherits from McpGym
and configures the airline domain:
get_reservation_details
or get_user_details
. We expose the same tools that the original πΒ²-benchmark uses.
Example Tool Definition:
Airline Business Environment
TheAirlineEnvironment
class manages the actual airline business logic and database operations. It loads a flight database and provides methods for booking, cancellation, user management, and other airline operations:
Pytest Implementation
Finally, we also integrate the πΒ²-bench airline evaluation with the eval-protocol pytest framework through a test function that orchestrates the simulated user, MCP environment, and multi-dimensional evaluation criteria.Step 1: Dataset Adapter
Thetau_bench_airline_to_evaluation_row
function converts raw πΒ²-bench dataset entries into the eval-protocolβs EvaluationRow
format:
- System Prompt Loading: Reads domain-specific agent instructions from external files
- Metadata Preservation: Stores all πΒ²-bench-specific data in
input_metadata.dataset_info
- Initial System Message: Sets up the conversation with the agentβs role and instructions
Step 2: Test Configuration
The@evaluation_test
decorator configures the evaluation with πΒ²-bench-specific parameters:
rollout_processor=default_mcp_gym_rollout_processor
: Uses a default MCP Gym processor for multi-turn conversations with simulated users, reusable for any evaluation benchmark that uses the same MCP Gym architectureserver_script_path="examples/tau2_mcp/server.py"
: Points to the MCP server that hosts the airline environmentpassed_threshold=0.4
: Threshold of 40% must be achieved for this test to passmax_concurrent_rollouts=32
: High concurrency for efficient evaluation of multiple scenarios