@evaluation_test
decorator in the Python SDK.
num_runs > 1
, each repetition has a unique run_id
.
trajectory
for a single row. Each rollout has a unique rollout_id
.
messages
(and optional tool calls) produced during a rollout.
messages
, optional ground_truth
, and the evaluatorβs evaluation_result
.
EvaluationRow
.
@evaluation_test
-decorated function. It computes a score
in
[0, 1] and writes it to the rowβs evaluation_result
.
content
supports either a string or OpenAI content parts.
EvaluateResult
represents the complete result of an evaluator, providing an overall score and component metrics.
MetricResult
objectsstep_outputs
for reinforcement learningEvaluationRow
is the canonical JSON-serializable unit of data used for both single-turn and trajectory evaluations. It contains the conversation, tool context, evaluation results, and metadata needed for reproducibility and analysis.
rollout_status
captures running/finished/errorinput_metadata
, seeds, and identifiers support traceabilityEvaluationRow
s. When saved to file, it is a JSONL file where each
line is a JSON-encoded EvaluationRow
.
EvaluationTest
represents a test configuration for evaluating models.
While not explicitly defined as a separate class in the current implementation,
evaluation tests are configured through the evaluation_test
decorator. The decorator
can be used to configure the following:
input_messages
model
) and generation settings via completion_params
passed_threshold
), with optional standard deviation constraintdefault_single_turn_rollout_processor
)num_runs=1
)pointwise
or batch
)mean
) and optional env overrides for summariesMcpGym
is the base class for building environments that an LLM can interact with via MCP tool calls (data plane) while exposing rewards and episode status via HTTP control-plane endpoints. This enables reproducible RL-style rollouts with clean separation of concerns.
Key concepts:
session_id
keys route control-plane queries to the right episodecontrol_plane_endpoint(path)
: Decorator to register a session-aware endpoint_register_tools()
: Register domain tools with self.mcp.tool()
format_observation(obs, env) -> Dict[str, Any]
: Return JSON-serializable observation payloadsrun(transport="streamable-http")
: Start the FastMCP server with high-concurrency settings/control/reward
, /control/status
, /control/info
, /control/initial_state
python-sdk/eval_protocol/mcp/mcpgym.py
for the full implementation including the control_plane_endpoint
decorator and session handling.
EnvironmentAdapter
class provides the interface for connecting environments to the MCP framework.
create_environment()
: Create and return a new environment instancecreate_environment_with_seed()
: Create environment with specific seed for reproducibilityreset_environment()
: Reset environment to initial statestep_environment()
: Execute one step in the environmentclose_environment()
: Clean up environment resourcesparse_action()
: Parse action string to environment-specific formatformat_observation()
: Format observation for MCP transmissiongpt-4o
or llama-3.1-8b
. In more advanced scenarios, a policy can be your own custom fine-tuned model.
The LiteLLMPolicy
class provides a unified implementation that works with ANY MCP environment via tool calling:
OpenAIPolicy
: OpenAI-specific policy implementationAnthropicPolicy
: Anthropic Claude-specific policy implementationFireworksPolicy
: Fireworks AI-specific policy implementationLocalPolicy
: Local model policy implementation