Multi-turn retail environment evaluation with MCP tool interactions and comprehensive reward scoring
tests/pytest/test_tau_bench_retail.py
and exported as tau_bench_retail
.@evaluation_test
uses MCPGymRolloutProcessor
for multi-turn tool interactions--ep-max-rows=5
for quick testing, or --ep-reasoning-effort=high
for more thorough evaluation of the stochastic multi-turn interactions.