𝜏²-bench — Retail - Eval Protocol

This example demonstrates a multi-turn retail customer service evaluation using 𝜏²-bench environments and MCP tool interactions. For a detailed walkthrough of the concepts behind this evaluation, see our Multi-Turn Evaluation with User Simulation tutorial.

You can find the complete implementation in the Python SDK at tests/pytest/test_tau_bench_retail.py and exported as tau_bench_retail.

What it does

Uses multi-turn conversations with MCP tool calling in a retail environment
Evaluates agents across database state validation and communication quality
Applies multiplicative scoring where all criteria must pass for full credit
Runs simulated customer service scenarios with realistic tool interactions

How it’s configured

@evaluation_test uses MCPGymRolloutProcessor for multi-turn tool interactions
Retail dataset entries include evaluation criteria and user simulation contexts
𝜏²-bench reward system validates environment state changes and communication quality

Run it locally

After installing eval-protocol, you can run the benchmark from anywhere:

pytest --pyargs eval_protocol.benchmarks.test_tau_bench_retail -v \
  --ep-print-summary --ep-summary-json artifacts/tau_bench_retail.json

Use --ep-max-rows=5 for quick testing, or --ep-reasoning-effort=high for more thorough evaluation of the stochastic multi-turn interactions.

Notes

This evaluation involves multi-turn conversations with tool calling, making it computationally intensive
Multiple runs recommended due to the stochastic nature of multi-turn user simulation
Final score uses multiplicative reward where all evaluation criteria must pass for full credit

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

𝜏²-bench — Retail

What it does

How it’s configured

Run it locally

Notes

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​What it does

​How it’s configured

​Run it locally

​Notes

What it does

How it’s configured

Run it locally

Notes