Klavis MCP Environments

Klavis AI provides hosted Model Context Protocol (MCP) servers and managed sandbox environments that integrate with Eval Protocol. This guide covers two ways to use Klavis:

Klavis MCP Sandbox - Fully managed isolated environments for model training and evaluation at scale
Klavis MCP Server - Direct MCP server connections using your own accounts

Which Option Should You Choose?

Use Klavis MCP Sandbox

Use Klavis MCP Sandbox if you only have input data and ground truth for your RL work. Klavis Sandbox handles all the tooling infrastructure for you:

Hosted MCP Servers - hundreds of pre-built servers ready to use
Authentication - OAuth and session management handled automatically
Isolated Concurrency Environments - run 64+ models in parallel without interference
Tooling State Management - automatic initialization, reset, and cleanup
Scaling - dedicated QPS per instance with automatic account pooling

This is the turnkey solution for model training and reinforcement learning with tools.

Use Klavis MCP Server

Use Klavis MCP Server if you already have your own tooling infrastructure (authentication, isolated environments, state management, scaling) but only need Klavis hosted MCP servers to perform tool calls for your RL or model training work. This option allows you to connect directly to 100+ external applications through Klavis MCP while maintaining full control over your evaluation and training pipeline.

Use with Klavis MCP Sandbox

Klavis MCP Sandbox provides fully managed, isolated sandbox environments designed for training and evaluating models at scale. Each sandbox has dedicated accounts, automatic state initialization, and cleanup - allowing you to focus on model interaction without managing sandbox environments.

Key Features

Isolated Environments: Each sandbox gets dedicated, authenticated sessions with automatic token management
Account Pooling: Dynamic pool of test accounts supporting 64+ concurrent models
State Management: Built-in initialize, dump, and reset APIs for environment lifecycle
Supported Services: Gmail, Jira, Salesforce, Slack, Linear, Google Calendar, and 100+ more

Setup

Set up your API keys:

export KLAVIS_API_KEY=your_klavis_api_key
export FIREWORKS_API_KEY=your_fireworks_api_key

Step 1: Define Your Input Data and Ground Truth

Create a JSONL dataset file with your test cases. Each row should include:

initialize_data: Initial state to seed the sandbox
messages: The task instruction for your model
ground_truth: Expected final state after the model completes the task

Example dataset structure:

{
  "initialize_data": {
    "messages": [
      {
        "subject": "Project Update",
        "to": "zihao@klavisai.com",
        "body": "The project is progressing well.",
        "from": "sarah@klavisai.com",
        "labels": ["INBOX"]
      }
    ],
    "drafts": []
  },
  "messages": "Please delete the email with subject 'Spam Newsletter' from my inbox.",
  "ground_truth": {
    "messages": [
      {
        "subject": "Project Update",
        "to": "zihao@klavisai.com",
        "body": "The project is progressing well.",
        "from": "sarah@klavisai.com",
        "labels": ["INBOX"]
      }
    ],
    "drafts": []
  }
}

See full example dataset for more test cases.

Step 2: Implement Your RolloutProcessor

Use the KlavisSandboxRolloutProcessor to handle sandbox lifecycle management. The processor will:

Create an isolated sandbox instance
Initialize the sandbox with your input data
Run your model with MCP tools from the sandbox
Dump the final state after model interaction
Clean up and return sandbox to pool

from eval_protocol.pytest import KlavisSandboxRolloutProcessor

rollout_processor = KlavisSandboxRolloutProcessor(
    server_name="gmail",  # or "jira", "salesforce", "slack", etc.
)

For custom initialization logic, you can extend KlavisSandboxRolloutProcessor:

from typing import Dict, Any
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest import KlavisSandboxRolloutProcessor

def custom_initialize_data(row: EvaluationRow) -> Dict[str, Any]:
    # Custom logic to transform your row data
    # into sandbox initialization format
    return {
        "messages": row.input_metadata.session_data.get("emails", []),
        "drafts": []
    }

rollout_processor = KlavisSandboxRolloutProcessor(
    server_name="gmail",
    initialize_data_factory=custom_initialize_data
)

See the full implementation for advanced customization.

Step 3: Evaluate by Comparing State with Ground Truth

Create your evaluation test that compares the final sandbox state with your ground truth:

from eval_protocol.pytest import evaluation_test, KlavisSandboxRolloutProcessor
from eval_protocol.models import EvaluationRow, EvaluateResult
from openai import AsyncOpenAI

@evaluation_test(
    input_dataset=["datasets/klavis_gmail_sandbox_test.jsonl"],
    completion_params=[{"model": "fireworks_ai/accounts/fireworks/models/llama-v3p3-70b-instruct"}],
    rollout_processor=KlavisSandboxRolloutProcessor(server_name="gmail"),
    mode="pointwise",
)
async def test_gmail_sandbox(row: EvaluationRow) -> EvaluationRow:
    # Extract final sandbox state and ground truth
    sandbox_data = row.execution_metadata.extra.get("sandbox_data", {})
    ground_truth = row.ground_truth
    
    # Use LLM judge to evaluate
    async with AsyncOpenAI(api_key=os.environ["FIREWORKS_API_KEY"]) as client:
        response = await client.chat.completions.create(
            model="accounts/fireworks/models/kimi-k2-thinking",
            messages=[{"role": "user", "content": f"Compare final state {sandbox_data} with expected {ground_truth}. Return score 0-1."}],
            response_format={"type": "json_schema", ...},
        )
        score = json.loads(response.choices[0].message.content).get("score", 0.0)
        row.evaluation_result = EvaluateResult(score=score)
    
    return row

The final sandbox state is available in row.execution_metadata.extra["sandbox_data"]. Use an LLM judge to semantically compare it with your ground truth. See the complete test implementation for the full example.

Use with Klavis MCP Server

Setting Up Klavis MCP Server

Login to your Klavis AI account, then find the applications you want to connect with Eval Protocol and enable MCP for those applications. Follow the auth flow to authorize Klavis MCP to access those applications on your behalf. You can follow the Klavis quickstart guide here to set up your MCP. In the Klavis dashboard, click Add to Other Clients, and generate the access token. Save the access token in .env file as KLAVIS_API_KEY. The Klavis MCP is defined as follows in Eval Protocol configuration:

{
  "mcpServers": {
    "klavis-strata": {
      "url": "https://strata.klavis.ai/mcp/",
      "authorization": "Bearer ${KLAVIS_API_KEY}"
    }
  }
}

Using Klavis MCP Server in Eval Protocol

We’ve set up an example in Eval Protocol to use Klavis MCP Server. You can also use it to connect to more applications and add more use cases. Here is the example test file. In this example, we connect to Gmail, Notion and Outlook Calendar using Klavis MCP, and have a few example test cases. To run this example workflow, you need to set up the test cases in those applications.

Gmail

No particular setup. You should have at least 5 emails in your Gmail inbox.

Notion

Copy this Notion page template (credit to MCPMark) to your Notion workspace. And when you authorize Klavis MCP to access Notion, make sure to give access to this page.

Outlook Calendar

You should set up the following calendar events in your Outlook calendar. It’s recommended to create a new outlook account with a clean calendar for testing.

Create 3 events today. It’s better one starting at 12 am today, and one ending at 12 am tomorrow.
Create an event that covers the whole workding hours except the first and last hour of your next working day. Outlook calendar default working hour is Monday to Friday, 8 am to 5 pm. In this case, you should create an event from 9 am to 4 pm on your next working day.
Create total 8 events on this week’s working days. It should include the above events if they are on working days.
Follow step 1, create 2 events on next week’s Thursday.
Follow step 3, create total 5 events on next week’s working days.
Follow step 1, create 4 events on Oct 15 2025.
Follow step 3, create total 9 events from Oct 13 to Oct 17, 2025.

Resources

Klavis Sandbox Intro

Learn about tooling infrastructure for LLM training, RL and evaluation.

Klavis MCP Servers

Browse all high quality MCP servers written and evaluated by Klavis AI.

Example Notebook

Create sandboxes, seed data, run an agent, then dump and clean up.

Klavis Sandbox API

Manage isolated sandbox environments for training/eval: pooling, init, export, teardown.

Getting Started

Integrations

Using the Logs UI

Reference

Which Option Should You Choose?

Use Klavis MCP Sandbox

Use Klavis MCP Server

Use with Klavis MCP Sandbox

Key Features

Setup

Step 1: Define Your Input Data and Ground Truth

Step 2: Implement Your RolloutProcessor

Step 3: Evaluate by Comparing State with Ground Truth

Use with Klavis MCP Server

Setting Up Klavis MCP Server

Using Klavis MCP Server in Eval Protocol

Gmail

Notion

Outlook Calendar

Resources

Klavis Sandbox Intro

Klavis MCP Servers

Example Notebook

Klavis Sandbox API

Getting Started

Integrations

Using the Logs UI

Reference

​Which Option Should You Choose?

​Use Klavis MCP Sandbox

​Use Klavis MCP Server

​Use with Klavis MCP Sandbox

​Key Features

​Setup

​Step 1: Define Your Input Data and Ground Truth

​Step 2: Implement Your RolloutProcessor

​Step 3: Evaluate by Comparing State with Ground Truth

​Use with Klavis MCP Server

​Setting Up Klavis MCP Server

​Using Klavis MCP Server in Eval Protocol

​Gmail

​Notion

​Outlook Calendar

​Resources

Klavis Sandbox Intro

Klavis MCP Servers

Example Notebook

Klavis Sandbox API

Which Option Should You Choose?

Use Klavis MCP Sandbox

Use Klavis MCP Server

Use with Klavis MCP Sandbox

Key Features

Setup

Step 1: Define Your Input Data and Ground Truth

Step 2: Implement Your RolloutProcessor

Step 3: Evaluate by Comparing State with Ground Truth

Use with Klavis MCP Server

Setting Up Klavis MCP Server

Using Klavis MCP Server in Eval Protocol

Gmail

Notion

Outlook Calendar

Resources