Core Execution Concepts

The following concepts define the lifecycle and data units of an evaluation. These match the semantics used by the @evaluation_test decorator in the Python SDK.

invocation

A single execution of a test function. One invocation can generate one or more experiments.

experiment

A group of runs for a specific combination of parameters (e.g., model Γ— dataset Γ— generation params). Each new execution of the test function produces a new experiment.

run

A group of rollouts produced when repeating the same experiment multiple times. When num_runs > 1, each repetition has a unique run_id.

rollout

The process that produces a trajectory for a single row. Each rollout has a unique rollout_id.

trajectory

The sequence of chat messages (and optional tool calls) produced during a rollout.

row

The atomic evaluation unit. A row contains the conversation messages, optional ground_truth, and the evaluator’s evaluation_result.

dataset

A collection (list) of rows. When stored, it is a JSONL file where each line is an EvaluationRow.

eval

The rubric implemented in the body of an @evaluation_test-decorated function. It computes a score in [0, 1] and writes it to the row’s evaluation_result.

Foundational Types

Message

Represents a chat message with trajectory evaluation support. content supports either a string or OpenAI content parts.
class ChatCompletionContentPartTextParam(BaseModel):
    text: str
    type: Literal["text"] = "text"

class Message(BaseModel):
    role: str  # assistant, user, system, tool
    content: Optional[Union[str, List[ChatCompletionContentPartTextParam]]] = ""
    name: Optional[str] = None
    tool_call_id: Optional[str] = None
    tool_calls: Optional[List[ChatCompletionMessageToolCall]] = None
    function_call: Optional[FunctionCall] = None
    control_plane_step: Optional[Dict[str, Any]] = None

CompletionParams

CompletionParams = Dict[str, Any]
"""
Provider-agnostic completion parameters.

Required:
- model: str

Common fields:
- temperature: Optional[float]
- max_tokens: Optional[int]
- top_p: Optional[float]

Extra provider-specific fields are allowed and passed through (e.g., max_tool_calls).
"""

InputMetadata

class InputMetadata(BaseModel):
    # Accepts additional keys for future extensibility
    # (model_config = ConfigDict(extra="allow") in implementation)

    row_id: Optional[str]  # defaulted to a generated ID
    completion_params: CompletionParams = Field(default_factory=dict)
    dataset_info: Optional[Dict[str, Any]]  # seed, system_prompt, environment_context, etc.
    session_data: Optional[Dict[str, Any]]

RolloutStatus

class RolloutStatus(BaseModel):
    status: Literal["running","finished","error"] = "running"
    termination_reason: Optional[str]

MetricResult

Result of a single metric evaluation:
class MetricResult(BaseModel):
    is_score_valid: bool = True
    score: float  # Between 0.0 and 1.0
    reason: str  # Explanation for the score

StepOutput

Defines the base reward and other metrics for a single conceptual step within a rollout:
class StepOutput(BaseModel):
    step_index: Union[int, str]  # User-defined index for the step
    base_reward: float  # Base reward calculated by the user's reward function
    terminated: bool = False  # Whether the environment signaled termination
    control_plane_info: Optional[Dict[str, Any]]  # Structured info from environment
    metrics: Dict[str, Any] = Field(default_factory=dict)  # Optional custom metrics
    reason: Optional[str]  # Optional explanation for the step's base reward

EvaluationThreshold

class EvaluationThreshold(BaseModel):
    success: float  # Minimum success rate threshold (0.0 to 1.0)
    standard_deviation: Optional[float]  # Optional maximum stddev threshold

EvalMetadata

class EvalMetadata(BaseModel):
    name: str
    description: Optional[str]
    version: str  # PEP 440 version string (auto-populated)
    status: Optional[Literal["running","finished","error","stopped"]]
    num_runs: int
    aggregation_method: str
    passed_threshold: Optional[EvaluationThreshold]
    passed: Optional[bool]

ExecutionMetadata

class ExecutionMetadata(BaseModel):
    invocation_id: Optional[str]
    experiment_id: Optional[str]
    rollout_id: Optional[str]
    run_id: Optional[str]

EvaluateResult

The EvaluateResult represents the complete result of an evaluator, providing an overall score and component metrics.
class EvaluateResult(BaseModel):
    # Core evaluation data
    score: float  # Overall evaluation score (0.0 to 1.0)
    is_score_valid: bool  # Whether the overall score is valid
    reason: Optional[str]  # Optional explanation for the overall score
    
    # Component metrics
    metrics: Dict[str, MetricResult]  # Dictionary of component metrics
    
    # RL-specific fields
    step_outputs: Optional[List[StepOutput]]  # Per-step base rewards for RL
    
    # Error handling
    error: Optional[str]  # Optional error message if evaluation failed
    
    # Trajectory information
    trajectory_info: Optional[Dict[str, Any]]  # Additional trajectory-level information
    final_control_plane_info: Optional[Dict[str, Any]]  # Final control plane state
Key Features:
  • Unified Model: Serves both per-turn and per-trajectory evaluation scenarios
  • Component Metrics: Detailed breakdown through MetricResult objects
  • RL Support: Per-step base rewards via step_outputs for reinforcement learning
  • Error Handling: Graceful error reporting and validation
  • Trajectory Info: Additional metadata for trajectory-based evaluations

EvaluationRow

The EvaluationRow is the canonical JSON-serializable unit of data used for both single-turn and trajectory evaluations. It contains the conversation, tool context, evaluation results, and metadata needed for reproducibility and analysis.
class EvaluationRow(BaseModel):
    # Core conversation (trajectory) data
    messages: List[Message]

    # Tool and function call information
    tools: Optional[List[Dict[str, Any]]] = None

    # Input-related metadata
    input_metadata: InputMetadata = Field(default_factory=InputMetadata)

    # Rollout status
    rollout_status: RolloutStatus = Field(default_factory=RolloutStatus)

    # Optional ground truth reference
    ground_truth: Optional[str] = None

    # Unified evaluation result
    evaluation_result: Optional[EvaluateResult] = None

    # Correlation identifiers grouped under execution metadata
    execution_metadata: ExecutionMetadata = Field(default_factory=ExecutionMetadata)

    # LLM usage statistics
    usage: Optional[CompletionUsage] = None

    # Timestamps and evaluation metadata
    created_at: datetime = Field(default_factory=datetime.now)
    eval_metadata: Optional[EvalMetadata] = None

    # Process info for watchdogs
    pid: Optional[int] = None
Key Features:
  • Unified Format: Canonical row format for both pointwise and trajectory evaluations
  • Explicit Status: rollout_status captures running/finished/error
  • Reproducibility: input_metadata, seeds, and identifiers support traceability
  • Usage Tracking: Captures token usage statistics from LLM calls

Dataset

A list of EvaluationRows. When saved to file, it is a JSONL file where each line is a JSON-encoded EvaluationRow.

JSONL example

{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Add 2 and 3."},{"role":"assistant","content":"5"}],"tools":null,"input_metadata":{"row_id":"row_123","completion_params":{"model":"gpt-4o","temperature":0.0,"max_tokens":256,"max_tool_calls":0},"dataset_info":{"seed":42,"system_prompt":"You are a helpful assistant.","environment_context":{}},"session_data":{"mode":"batch"}},"rollout_status":{"status":"finished","termination_reason":""},"ground_truth":"5","evaluation_result":{"score":1.0,"is_score_valid":true,"reason":"Exact match","metrics":{"exact_match":{"is_score_valid":true,"score":1.0,"reason":"assistant output matches ground truth"}},"step_outputs":null,"error":null,"trajectory_info":null,"final_control_plane_info":null},"execution_metadata":{"invocation_id":"ivk_abcd","experiment_id":"exp_efgh","rollout_id":"rll_ijkl","run_id":null},"usage":{"prompt_tokens":10,"completion_tokens":1,"total_tokens":11},"created_at":"2025-01-01T12:00:00","eval_metadata":{"name":"basic_addition","description":"Verify simple arithmetic","version":"0.1.0","status":"finished","num_runs":1,"aggregation_method":"mean","passed_threshold":{"success":0.95},"passed":true},"pid":12345}

EvaluationTest

The EvaluationTest represents a test configuration for evaluating models. While not explicitly defined as a separate class in the current implementation, evaluation tests are configured through the evaluation_test decorator. The decorator can be used to configure the following:
  • Dataset Configuration: JSONL files containing test cases or hard-coded input_messages
  • Model Configuration: Completion parameters (must include model) and generation settings via completion_params
  • Evaluation Criteria: Success thresholds (via passed_threshold), with optional standard deviation constraint
  • Environment Configuration: MCP config, rollout steps, server path, and concurrency
  • Rollout Processor: Function to execute rollouts (e.g., default_single_turn_rollout_processor)
  • Number of Runs: Number of times to repeat the rollout (e.g., num_runs=1)
  • Mode: Evaluation mode (pointwise or batch)
  • Aggregation: Aggregation method (e.g., mean) and optional env overrides for summaries
@evaluation_test(
    input_dataset=["tests/pytest/data/markdown_dataset.jsonl"],
    dataset_adapter=markdown_dataset_to_evaluation_row,
    completion_params=[{
        "model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
        "temperature": 0.0,
        "max_tokens": 4096,
    }],
    passed_threshold={"success": 0.5},
    rollout_processor=default_single_turn_rollout_processor,
    num_runs=1,
    mode="pointwise",
)
def test_markdown_highlighting_evaluation(row: EvaluationRow) -> EvaluationRow:
    ...

MCP Gym

McpGym is the base class for building environments that an LLM can interact with via MCP tool calls (data plane) while exposing rewards and episode status via HTTP control-plane endpoints. This enables reproducible RL-style rollouts with clean separation of concerns. Key concepts:
  • Data plane: Tool calls and JSON responses used by the model to act and observe state
  • Control plane: Session-scoped endpoints for rewards, termination, and info
  • Multi-session: Stable session_id keys route control-plane queries to the right episode
Core API surface:
  • control_plane_endpoint(path): Decorator to register a session-aware endpoint
  • _register_tools(): Register domain tools with self.mcp.tool()
  • format_observation(obs, env) -> Dict[str, Any]: Return JSON-serializable observation payloads
  • run(transport="streamable-http"): Start the FastMCP server with high-concurrency settings
  • Standard control-plane endpoints on subclasses: /control/reward, /control/status, /control/info, /control/initial_state
Example stub:
class McpGym(ABC):
    def __init__(self, server_name: str, adapter: EnvironmentAdapter, seed: Optional[int] = None, max_workers: Optional[int] = None):
        ...

    @abstractmethod
    def _register_tools(self):
        ...

    def format_observation(self, obs: Any, env: Any) -> Dict[str, Any]:
        ...

    def run(self, transport: str = "streamable-http", **kwargs):
        ...
See python-sdk/eval_protocol/mcp/mcpgym.py for the full implementation including the control_plane_endpoint decorator and session handling.

Environment

The EnvironmentAdapter class provides the interface for connecting environments to the MCP framework.
class EnvironmentAdapter:
    """
    Environment adapter with default implementations.
    
    Users can either use this class directly by providing an env_class,
    or inherit from it to customize specific methods for their environment.
    This provides a clean separation between the MCP protocol layer
    and the environment implementation.
    """
Key Features:
  • Default Implementations: Works with most gymnasium-style and complex environments
  • Flexible Configuration: Supports custom configuration dictionaries
  • Seed Support: Reproducible environments through seed-based initialization
  • Clean Interface: Separates MCP protocol layer from environment implementation
Core Methods:
  • create_environment(): Create and return a new environment instance
  • create_environment_with_seed(): Create environment with specific seed for reproducibility
  • reset_environment(): Reset environment to initial state
  • step_environment(): Execute one step in the environment
  • close_environment(): Clean up environment resources
  • parse_action(): Parse action string to environment-specific format
  • format_observation(): Format observation for MCP transmission

Policy

A policy is a model such as gpt-4o or llama-3.1-8b. In more advanced scenarios, a policy can be your own custom fine-tuned model. The LiteLLMPolicy class provides a unified implementation that works with ANY MCP environment via tool calling:
class LiteLLMPolicy(LLMBasePolicy):
    """
    Unified LiteLLM policy implementation that works with ANY MCP environment via tool calling.
    
    Supports OpenAI, Anthropic, Fireworks AI
    Includes built-in retry logic and caching.
    NO environment-specific logic - everything comes from MCP tools and dataset prompts.
    """
Key Features:
  • Provider Agnostic: Supports OpenAI, Anthropic, Fireworks AI, and other providers
  • Built-in Caching: Multiple cache types (memory, Redis, dual, S3, disk)
  • Retry Logic: Robust retry strategies with exponential backoff
  • Tool Calling: Native support for MCP tool calling
  • Environment Agnostic: No environment-specific logic - everything from MCP tools
Specialized Implementations:
  • OpenAIPolicy: OpenAI-specific policy implementation
  • AnthropicPolicy: Anthropic Claude-specific policy implementation
  • FireworksPolicy: Fireworks AI-specific policy implementation
  • LocalPolicy: Local model policy implementation
Core Capabilities:
  • Multi-Tool Support: Handle multiple tool calls per turn
  • Conversation History: Maintain context across interactions
  • Error Handling: Graceful handling of API failures and retries
  • Caching: Response caching for improved performance and cost reduction
  • Logging: Comprehensive logging for debugging and analysis

Additional Core Classes

MCPSession

Represents a single MCP session with an environment:
@dataclass
class MCPSession:
    session_id: str
    base_url: str
    seed: Optional[int]
    model_id: str
    dataset_row: Optional[DatasetRow] = None
    terminated: bool = False
    last_observation: Any = None
    _exit_stack: Optional[AsyncExitStack] = None  # persistent connection resources
    _mcp_session: Optional[ClientSession] = None  # persistent MCP client session

Trajectory

Represents a complete rollout trajectory:
@dataclass
class Trajectory:
    session: MCPSession
    observations: List[Any]
    actions: List[str]
    rewards: List[float]
    terminated: bool
    total_reward: float
    steps: int
    duration: float
    control_plane_steps: List[Dict[str, Any]]
    control_plane_summary: Dict[str, Any]
    termination_reason: str
    conversation_history: List[Dict[str, Any]]
    usage: Dict[str, int] = field(default_factory=dict)