This example demonstrates a multi-turn retail customer service evaluation using 𝜏²-bench environments and MCP tool interactions. For a detailed walkthrough of the concepts behind this evaluation, see our Multi-Turn Evaluation with User Simulation tutorial.
You can find the complete implementation in the Python SDK at tests/pytest/test_tau_bench_retail.py and exported as tau_bench_retail.

What it does

  • Uses multi-turn conversations with MCP tool calling in a retail environment
  • Evaluates agents across database state validation and communication quality
  • Applies multiplicative scoring where all criteria must pass for full credit
  • Runs simulated customer service scenarios with realistic tool interactions

How it’s configured

  • @evaluation_test uses MCPGymRolloutProcessor for multi-turn tool interactions
  • Retail dataset entries include evaluation criteria and user simulation contexts
  • 𝜏²-bench reward system validates environment state changes and communication quality

Run it locally

After installing eval-protocol, you can run the benchmark from anywhere:
pytest --pyargs eval_protocol.benchmarks.test_tau_bench_retail -v \
  --ep-print-summary --ep-summary-json artifacts/tau_bench_retail.json
Use --ep-max-rows=5 for quick testing, or --ep-reasoning-effort=high for more thorough evaluation of the stochastic multi-turn interactions.

Notes

  • This evaluation involves multi-turn conversations with tool calling, making it computationally intensive
  • Multiple runs recommended due to the stochastic nature of multi-turn user simulation
  • Final score uses multiplicative reward where all evaluation criteria must pass for full credit