id
: Unique identifier for the task scenariouser_prompt_template
: Template for presenting information to the simulated userenvironment_context
: Domain specification and environmental settingsuser_simulation
: Complete definition of the simulated userβs behavior, knowledge, and personalityevaluation_criteria
: Specific actions, communications, and assertions the agent must fulfill"actions"
array with the correct parameters
"communicate_info"
(e.g., the number β4β for suitcase allowance)
"nl_assertions"
- complex behavioral requirements like βAgent detects that user is actually a Silver memberβ and proper policy application
AirlineDomainMcp
server that handles MCP tool calls, and the AirlineEnvironment
that manages the actual airline business logic.
AirlineDomainMcp
class inherits from McpGym
and configures the airline domain:
get_reservation_details
or get_user_details
. We expose the same tools that the original πΒ²-benchmark uses.
Example Tool Definition:
AirlineEnvironment
class manages the actual airline business logic and database operations. It loads a flight database and provides methods for booking, cancellation, user management, and other airline operations:
tau_bench_airline_to_evaluation_row
function converts raw πΒ²-bench dataset entries into the eval-protocolβs EvaluationRow
format:
input_metadata.dataset_info
@evaluation_test
decorator configures the evaluation with πΒ²-bench-specific parameters:
rollout_processor=default_mcp_gym_rollout_processor
: Uses a default MCP Gym processor for multi-turn conversations with simulated users, reusable for any evaluation benchmark that uses the same MCP Gym architectureserver_script_path="examples/tau2_mcp/server.py"
: Points to the MCP server that hosts the airline environmentpassed_threshold=0.4
: Threshold of 40% must be achieved for this test to passmax_concurrent_rollouts=32
: High concurrency for efficient evaluation of multiple scenarios