The Split
- Data plane (MCP):
list_tools
,call_tool
,list_resources
/read_resource
.- Purpose: tool schemas and observations only—what the agent receives.
- Control plane (HTTP):
/control/*
endpoints withmcp-session-id
header.- Purpose: initial state, reward, status (terminated/truncated), and reset.
- Observations never come from control plane endpoints.
- Rewards/termination never come from tool results.
Sequence diagram (data vs control planes)
EP’s client enforces this separation in MCPConnectionManager and GeneralMCPVectorEnv.Control Plane Endpoints
Servers should implement the following endpoints alongside their MCP transport (e.g., athttps://your-server.example/mcp
for MCP, and https://your-server.example/control/...
for control):
-
POST /control/reset_session
- Headers:
mcp-session-id: <session_id>
- Body:
{ "seed": <int|null> }
- Use: cleanup/reseed before a rollout or at close.
- Headers:
-
GET /control/initial_state
- Headers:
mcp-session-id: <session_id>
- Returns: JSON initial observation/state used to seed the first user prompt.
- Headers:
-
GET /control/reward
- Headers:
mcp-session-id: <session_id>
- Returns:
{ "reward": <float> }
for the most recent step.
- Headers:
-
GET /control/status
- Headers:
mcp-session-id: <session_id>
- Returns:
{ "terminated": <bool>, "truncated": <bool> }
to indicate episode end.
- Headers:
- EP generates a stable
session_id
by hashing dataset row values and the model ID viagen_session_id(...)
and passes it in MCPclientInfo
and as the control-plane header. Heads up: it does not use run ID, so between runs, the MCP server needs to be restarted. This is automatically done in the current implementation ofMCPGymRolloutProcessor()
. - The simulator framework (SimulationServerBase) demonstrates session-aware design but you still need to expose the
/control/*
endpoints in your production server. Note: the EP client does not depend onSimulationServerBase
; it is provided as a reference pattern only.
End-to-End Flows
1) Initialization
- EP opens a streamable MCP session and sends
clientInfo
withsession_id
,seed
,config
, andmodel_id
. - EP pre-warms tool schemas via
list_tools
(data plane) and caches them. - EP fetches initial state via
GET /control/initial_state
(control plane); if that times out or fails, it falls back tolist_resources
/read_resource
(data plane) heuristics. - The initial observation seeds the first user prompt with your
user_prompt_template
.
- Initial state is session-aware (derived from control plane when available).
- Tool schemas are cached per
base_url
to avoid thundering herds.
2) Step Execution (per agent turn)
- Policy returns one or more MCP tool calls based on tool schemas and conversation history.
- EP executes the tool call via
call_tool
(data plane) and parses the observation from tool content. - EP queries control plane for reward and status:
GET /control/reward
→ scalar rewardGET /control/status
→terminated
/truncated
- EP attaches a control-plane step summary to the conversation for logging, including reward, termination, and tool calls.
- Observations never come from control plane endpoints
- Rewards/termination never come from tool results.
3) Termination
An episode ends when any of the following occurs:- Control plane status reports
terminated
(environment signaled end) ortruncated
(cutoff). - The policy returns
_no_tool_call
or_playback_terminate
(e.g., model finished or playback hit the end). - The simulated user signals stop; EP maps this to
termination_reason = user_stop
.
TerminationReason
values: stop
, length
, tool_calls
, plus environment-driven control_plane_signal
, max_steps
, user_stop
, error
.
4) Failure Recovery
EP is defensive at the boundaries between planes:- Initial state: If
/control/initial_state
fails or times out, EP falls back toread_resource
(and ultimately a default observation) so rollouts can proceed. - Tool responses: If a tool returns invalid/empty JSON, EP wraps it into a structured observation with an error tag instead of failing hard.
- Control queries:
/control/reward
and/control/status
use short timeouts; absent data yields defaults (0.0 reward, not-terminated) and the step continues. - Session re-init: Re-initialization closes any existing session handles and re-opens cleanly before retrying.
5) Cleanup
- At
close
, EP callsPOST /control/reset_session
and then closes the MCP transport.
Minimal Client Example
Multi-Server Aggregation (Optional)
If you need to aggregate tools from multiple MCP servers, EP provides MCPMultiClient that connects to both stdio and remote servers and exposes all tools under one client./control/*
endpoints.
Record/Playback
SetEP_PLAYBACK_FILE
to enable deterministic record/playback. During playback, the policy is stepped to match prior turns, and _playback_terminate
ends the episode at the recorded boundary. Control-plane step summaries and an optional OpenAI-format log are emitted for terminated trajectories.
Server Implementation Checklist
Use this as a reference when building the control plane alongside your MCP server.- Headers: include
mcp-session-id
on every control request; returnContent-Type: application/json
. - Session ID: treat as opaque but stable per dataset row + model; do not coalesce across different seeds/config.
- Idempotency: make
POST /control/reset_session
safe to call multiple times; ignore duplicate resets. - Initialization:
GET /control/initial_state
returns the initial observation JSON for this session, derived fromseed
andconfig
(from MCPclientInfo
).- Keep this response free of reward/termination fields; it seeds the first user prompt only.
- Step reporting:
GET /control/reward
returns{ "reward": <float> }
for the most recent applied action.GET /control/status
returns{ "terminated": <bool>, "truncated": <bool> }
for the episode state.- Do not include observation content here; that stays in the data plane.
- Timeouts and SLAs:
- EP uses ~15s timeout for initial_state under high concurrency (3s in playback) and ~3s for reward/status.
- Aim for sub-1s responses; if computation is heavy, cache per
session_id
.
- Errors:
- Use
4xx
for client mistakes (missing/invalidmcp-session-id
),5xx
for server errors. - On faults, respond with a minimal JSON error body; EP will default to
reward=0.0
andterminated=false
on non-200s.
- Use
- Concurrency:
- Expect many concurrent sessions; isolate per
session_id
and avoid global mutable state. - Ensure tool results (data plane) and control updates are applied atomically in your environment loop.
- Expect many concurrent sessions; isolate per
- Security:
- You may authenticate control endpoints; keep auth orthogonal to
mcp-session-id
routing. - Validate reasonable
session_id
lengths to prevent abuse.
- You may authenticate control endpoints; keep auth orthogonal to
Reading clientInfo on the Server
Servers using the low-level MCP server can extractclientInfo
extras to create stable, session-aware environments. Example:
- Use
session_id
as the key for per-session state. Seed and config should shape the initial state. - Keep observations on the data plane; publish reward and termination via
/control/*
.
GitHub References
- Client: MCP connection manager (control/data split)
- Client: Vector env/session manager
- Server: MCP-Gym base with control-plane endpoints
- Server: Simulation server base (session-aware patterns)
- Example servers implementing McpGym
- Frozen Lake: https://github.com/eval-protocol/python-sdk/blob/main/examples/frozen_lake_mcp/frozen_lake_mcp.py
- Lunar Lander: https://github.com/eval-protocol/python-sdk/blob/main/examples/lunar_lander_mcp/lunar_lander_mcp.py
- Cliff Walking: https://github.com/eval-protocol/python-sdk/blob/main/examples/cliff_walking_mcp/cliff_walking_mcp.py
- Blackjack: https://github.com/eval-protocol/python-sdk/blob/main/examples/blackjack_mcp/blackjack_mcp.py
- Tau2 domains: https://github.com/eval-protocol/python-sdk/blob/main/examples/tau2_mcp/tau2_mcp.py