Core Execution Concepts
The following concepts define the lifecycle and data units of an evaluation. These match the semantics used by the@evaluation_test
decorator in the Python SDK.
invocation
A single execution of a test function. One invocation can generate one or more experiments.experiment
A group of runs for a specific combination of parameters (e.g., model x dataset x generation params). Each new execution of the test function produces a new experiment.run
A group of rollouts produced when repeating the same experiment multiple times. Whennum_runs > 1
, each repetition has a unique run_id
.
rollout
The process that produces atrajectory
for a single row. Each rollout has a unique rollout_id
.
trajectory
The sequence of chatmessages
(and optional tool calls) produced during a rollout.
row
The atomic evaluation unit. A row contains the conversationmessages
, optional
ground_truth
, and the evaluator’s evaluation_result
. Every row is uniquely
identified by its row_id
. If not provided by the dataset, a stable hash is
generated based on the row’s content.
dataset
A collection (list) of rows. When stored, it is a JSONL file where each line is anEvaluationRow
.
eval
The rubric implemented in the body of an@evaluation_test
-decorated function. It computes a score
in
[0, 1] and writes it to the row’s evaluation_result
.
Foundational Types
JSONType
Message
Represents a chat message with trajectory evaluation support.content
supports either a string or OpenAI content parts.
CompletionParams
InputMetadata
ErrorInfo (AIP-193)
Structured error detail used insideStatus.details
per Google’s AIP-193.
Status (AIP-193)
TerminationReason
MetricResult
Result of a single metric evaluation:StepOutput
Defines the base reward and other metrics for a single conceptual step within a rollout:EvaluationThreshold
EvalMetadata
CostMetrics
ExecutionMetadata
EvaluateResult
TheEvaluateResult
represents the complete result of an evaluator, providing an overall score and component metrics.
- Unified Model: Serves both per-turn and per-trajectory evaluation scenarios
- Component Metrics: Detailed breakdown through
MetricResult
objects - RL Support: Per-step base rewards via
step_outputs
for reinforcement learning - Error Handling: Graceful error reporting and validation
- Trajectory Info: Additional metadata for trajectory-based evaluations
- Aggregation: Optional
agg_score
andstandard_error
for multi-run summaries
EvaluationRow
TheEvaluationRow
is the canonical JSON-serializable unit of data used for both single-turn and trajectory evaluations. It contains the conversation, tool context, evaluation results, and metadata needed for reproducibility and analysis.
- Unified Format: Canonical row format for both pointwise and trajectory evaluations
- Explicit Status:
rollout_status
captures running/finished/error - Reproducibility:
input_metadata
, seeds, and identifiers support traceability - Usage Tracking: Captures token usage statistics from LLM calls
Dataset
A list ofEvaluationRow
s. When saved to file, it is a JSONL file where each
line is a JSON-encoded EvaluationRow
.
JSONL example
EvaluationTest
TheEvaluationTest
represents a test configuration for evaluating models.
While not explicitly defined as a separate class in the current implementation,
evaluation tests are configured through the evaluation_test
decorator. The decorator
can be used to configure the following:
- Dataset Configuration: JSONL files containing test cases or hard-coded
input_messages
- Model Configuration: Completion parameters (must include
model
) and generation settings viacompletion_params
- Evaluation Criteria: Success thresholds (via
passed_threshold
), with optional standard deviation constraint - Environment Configuration: MCP config, rollout steps, server path, and concurrency
- Rollout Processor: Class to execute rollouts (e.g.,
SingleTurnRolloutProcessor()
) - Number of Runs: Number of times to repeat the rollout (e.g.,
num_runs=1
) - Mode: Evaluation mode (
pointwise
,groupwise
, orall
) - Aggregation: Aggregation method (e.g.,
mean
) and optional env overrides for summaries
MCP Gym
McpGym
is the base class for building environments that an LLM can interact with via MCP tool calls (data plane) while exposing rewards and episode status via HTTP control-plane endpoints. This enables reproducible RL-style rollouts with clean separation of concerns.
Key concepts:
- Data plane: Tool calls and JSON responses used by the model to act and observe state
- Control plane: Session-scoped endpoints for rewards, termination, and info
- Multi-session: Stable
session_id
keys route control-plane queries to the right episode
control_plane_endpoint(path)
: Decorator to register a session-aware endpoint_register_tools()
: Register domain tools withself.mcp.tool()
format_observation(obs, env) -> Dict[str, Any]
: Return JSON-serializable observation payloadsrun(transport="streamable-http")
: Start the FastMCP server with high-concurrency settings- Standard control-plane endpoints on subclasses:
/control/reward
,/control/status
,/control/info
,/control/initial_state
python-sdk/eval_protocol/mcp/mcpgym.py
for the full implementation including the control_plane_endpoint
decorator and session handling.
Environment
TheEnvironmentAdapter
class provides the interface for connecting environments to the MCP framework.
- Default Implementations: Works with most gymnasium-style and complex environments
- Flexible Configuration: Supports custom configuration dictionaries
- Seed Support: Reproducible environments through seed-based initialization
- Clean Interface: Separates MCP protocol layer from environment implementation
create_environment()
: Create and return a new environment instancecreate_environment_with_seed()
: Create environment with specific seed for reproducibilityreset_environment()
: Reset environment to initial statestep_environment()
: Execute one step in the environmentclose_environment()
: Clean up environment resourcesparse_action()
: Parse action string to environment-specific formatformat_observation()
: Format observation for MCP transmission
Policy
A policy is a model such asgpt-4o
or llama-3.1-8b
. In more advanced scenarios, a policy can be your own custom fine-tuned model.
The LiteLLMPolicy
class provides a unified implementation that works with ANY MCP environment via tool calling:
- Provider Agnostic: Supports OpenAI, Anthropic, Fireworks AI, and other providers
- Built-in Caching: Multiple cache types (memory, Redis, dual, S3, disk)
- Retry Logic: Robust retry strategies with exponential backoff
- Tool Calling: Native support for MCP tool calling
- Environment Agnostic: No environment-specific logic - everything from MCP tools
OpenAIPolicy
: OpenAI-specific policy implementationAnthropicPolicy
: Anthropic Claude-specific policy implementationFireworksPolicy
: Fireworks AI-specific policy implementationLocalPolicy
: Local model policy implementation
- Multi-Tool Support: Handle multiple tool calls per turn
- Conversation History: Maintain context across interactions
- Error Handling: Graceful handling of API failures and retries
- Caching: Response caching for improved performance and cost reduction
- Logging: Comprehensive logging for debugging and analysis