Skip to main content
Klavis AI provides hosted Model Context Protocol (MCP) servers and managed sandbox environments that integrate with Eval Protocol. This guide covers two ways to use Klavis:
  1. Klavis MCP Sandbox - Fully managed isolated environments for model training and evaluation at scale
  2. Klavis MCP Server - Direct MCP server connections using your own accounts

Which Option Should You Choose?

Use Klavis MCP Sandbox

Use Klavis MCP Sandbox if you only have input data and ground truth for your RL work. Klavis Sandbox handles all the tooling infrastructure for you:
  • Hosted MCP Servers - hundreds of pre-built servers ready to use
  • Authentication - OAuth and session management handled automatically
  • Isolated Concurrency Environments - run 64+ models in parallel without interference
  • Tooling State Management - automatic initialization, reset, and cleanup
  • Scaling - dedicated QPS per instance with automatic account pooling
This is the turnkey solution for model training and reinforcement learning with tools.

Use Klavis MCP Server

Use Klavis MCP Server if you already have your own tooling infrastructure (authentication, isolated environments, state management, scaling) but only need Klavis hosted MCP servers to perform tool calls for your RL or model training work. This option allows you to connect directly to 100+ external applications through Klavis MCP while maintaining full control over your evaluation and training pipeline.

Use with Klavis MCP Sandbox

Klavis MCP Sandbox provides fully managed, isolated sandbox environments designed for training and evaluating models at scale. Each sandbox has dedicated accounts, automatic state initialization, and cleanup - allowing you to focus on model interaction without managing sandbox environments.

Key Features

  • Isolated Environments: Each sandbox gets dedicated, authenticated sessions with automatic token management
  • Account Pooling: Dynamic pool of test accounts supporting 64+ concurrent models
  • State Management: Built-in initialize, dump, and reset APIs for environment lifecycle
  • Supported Services: Gmail, Jira, Salesforce, Slack, Linear, Google Calendar, and 100+ more

Setup

Set up your API keys:
export KLAVIS_API_KEY=your_klavis_api_key
export FIREWORKS_API_KEY=your_fireworks_api_key

Step 1: Define Your Input Data and Ground Truth

Create a JSONL dataset file with your test cases. Each row should include:
  • initialize_data: Initial state to seed the sandbox
  • messages: The task instruction for your model
  • ground_truth: Expected final state after the model completes the task
Example dataset structure:
{
  "initialize_data": {
    "messages": [
      {
        "subject": "Project Update",
        "to": "[email protected]",
        "body": "The project is progressing well.",
        "from": "[email protected]",
        "labels": ["INBOX"]
      }
    ],
    "drafts": []
  },
  "messages": "Please delete the email with subject 'Spam Newsletter' from my inbox.",
  "ground_truth": {
    "messages": [
      {
        "subject": "Project Update",
        "to": "[email protected]",
        "body": "The project is progressing well.",
        "from": "[email protected]",
        "labels": ["INBOX"]
      }
    ],
    "drafts": []
  }
}
See full example dataset for more test cases.

Step 2: Implement Your RolloutProcessor

Use the KlavisSandboxRolloutProcessor to handle sandbox lifecycle management. The processor will:
  1. Create an isolated sandbox instance
  2. Initialize the sandbox with your input data
  3. Run your model with MCP tools from the sandbox
  4. Dump the final state after model interaction
  5. Clean up and return sandbox to pool
from eval_protocol.pytest import KlavisSandboxRolloutProcessor

rollout_processor = KlavisSandboxRolloutProcessor(
    server_name="gmail",  # or "jira", "salesforce", "slack", etc.
)
For custom initialization logic, you can extend KlavisSandboxRolloutProcessor:
from typing import Dict, Any
from eval_protocol.models import EvaluationRow
from eval_protocol.pytest import KlavisSandboxRolloutProcessor

def custom_initialize_data(row: EvaluationRow) -> Dict[str, Any]:
    # Custom logic to transform your row data
    # into sandbox initialization format
    return {
        "messages": row.input_metadata.session_data.get("emails", []),
        "drafts": []
    }

rollout_processor = KlavisSandboxRolloutProcessor(
    server_name="gmail",
    initialize_data_factory=custom_initialize_data
)
See the full implementation for advanced customization.

Step 3: Evaluate by Comparing State with Ground Truth

Create your evaluation test that compares the final sandbox state with your ground truth:
from eval_protocol.pytest import evaluation_test, KlavisSandboxRolloutProcessor
from eval_protocol.models import EvaluationRow, EvaluateResult
from openai import AsyncOpenAI

@evaluation_test(
    input_dataset=["datasets/klavis_gmail_sandbox_test.jsonl"],
    completion_params=[{"model": "fireworks_ai/accounts/fireworks/models/llama-v3p3-70b-instruct"}],
    rollout_processor=KlavisSandboxRolloutProcessor(server_name="gmail"),
    mode="pointwise",
)
async def test_gmail_sandbox(row: EvaluationRow) -> EvaluationRow:
    # Extract final sandbox state and ground truth
    sandbox_data = row.execution_metadata.extra.get("sandbox_data", {})
    ground_truth = row.ground_truth
    
    # Use LLM judge to evaluate
    async with AsyncOpenAI(api_key=os.environ["FIREWORKS_API_KEY"]) as client:
        response = await client.chat.completions.create(
            model="accounts/fireworks/models/kimi-k2-thinking",
            messages=[{"role": "user", "content": f"Compare final state {sandbox_data} with expected {ground_truth}. Return score 0-1."}],
            response_format={"type": "json_schema", ...},
        )
        score = json.loads(response.choices[0].message.content).get("score", 0.0)
        row.evaluation_result = EvaluateResult(score=score)
    
    return row
The final sandbox state is available in row.execution_metadata.extra["sandbox_data"]. Use an LLM judge to semantically compare it with your ground truth. See the complete test implementation for the full example.

Use with Klavis MCP Server

Setting Up Klavis MCP Server

Login to your Klavis AI account, then find the applications you want to connect with Eval Protocol and enable MCP for those applications. Follow the auth flow to authorize Klavis MCP to access those applications on your behalf. You can follow the Klavis quickstart guide here to set up your MCP. In the Klavis dashboard, click Add to Other Clients, and generate the access token. Save the access token in .env file as KLAVIS_API_KEY. The Klavis MCP is defined as follows in Eval Protocol configuration:
{
  "mcpServers": {
    "klavis-strata": {
      "url": "https://strata.klavis.ai/mcp/",
      "authorization": "Bearer ${KLAVIS_API_KEY}"
    }
  }
}

Using Klavis MCP Server in Eval Protocol

We’ve set up an example in Eval Protocol to use Klavis MCP Server. You can also use it to connect to more applications and add more use cases. Here is the example test file. In this example, we connect to Gmail, Notion and Outlook Calendar using Klavis MCP, and have a few example test cases. To run this example workflow, you need to set up the test cases in those applications.

Gmail

No particular setup. You should have at least 5 emails in your Gmail inbox.

Notion

Copy this Notion page template (credit to MCPMark) to your Notion workspace. And when you authorize Klavis MCP to access Notion, make sure to give access to this page.

Outlook Calendar

You should set up the following calendar events in your Outlook calendar. It’s recommended to create a new outlook account with a clean calendar for testing.
  1. Create 3 events today. It’s better one starting at 12 am today, and one ending at 12 am tomorrow.
  2. Create an event that covers the whole workding hours except the first and last hour of your next working day. Outlook calendar default working hour is Monday to Friday, 8 am to 5 pm. In this case, you should create an event from 9 am to 4 pm on your next working day.
  3. Create total 8 events on this week’s working days. It should include the above events if they are on working days.
  4. Follow step 1, create 2 events on next week’s Thursday.
  5. Follow step 3, create total 5 events on next week’s working days.
  6. Follow step 1, create 4 events on Oct 15 2025.
  7. Follow step 3, create total 9 events from Oct 13 to Oct 17, 2025.

Resources