The Braintrust adapter allows you to pull data from Braintrust deployments and convert it to EvaluationRow format for use in evaluation pipelines. This enables you to evaluate production conversations and tool calling traces directly from your Braintrust deployment using powerful BTQL queries.

Installation

Install Eval Protocol with Braintrust support:
pip install 'eval-protocol[braintrust]'

Basic Usage

import pytest
from eval_protocol import (
    evaluation_test,
    aha_judge,
    multi_turn_assistant_to_ground_truth,
    EvaluationRow,
    SingleTurnRolloutProcessor,
    DynamicDataLoader,
    create_braintrust_adapter,
)


def braintrust_data_generator() -> list[EvaluationRow]:
    """Execute a BTQL query and convert results to EvaluationRow."""
    adapter = create_braintrust_adapter()
    btql_query = """
    select: *
    from: project_logs('your_project_id') traces
    limit: 50
    """
    return adapter.get_evaluation_rows(btql_query)


@pytest.mark.parametrize(
    "completion_params",
    [
        {"model": "gpt-4.1"},
        {
            "max_tokens": 131000,
            "extra_body": {"reasoning_effort": "medium"},
            "model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b",
        },
        {
            "max_tokens": 131000,
            "extra_body": {"reasoning_effort": "low"},
            "model": "fireworks_ai/accounts/fireworks/models/gpt-oss-20b",
        },
    ],
)
@evaluation_test(
    data_loaders=DynamicDataLoader(
        generators=[braintrust_data_generator],
    ),
    rollout_processor=SingleTurnRolloutProcessor(),
    preprocess_fn=multi_turn_assistant_to_ground_truth,
    max_concurrent_evaluations=2,
)
async def test_braintrust_data(row: EvaluationRow) -> EvaluationRow:
    return await aha_judge(row)

Configuration

Set up your Braintrust credentials using environment variables:
export BRAINTRUST_API_KEY="your_api_key"
export BRAINTRUST_PROJECT_ID="your_project_id"
export BRAINTRUST_API_URL="https://api.braintrust.dev"  # Optional, uses default

API Reference

BraintrustAdapter

The main adapter class for pulling data from Braintrust using BTQL queries.

get_evaluation_rows()

Execute a BTQL query and convert results to EvaluationRow format. See more about BTQL query syntax here.
def get_evaluation_rows(
    self,
    btql_query: str,
    include_tool_calls: bool = True,
    converter: Optional[TraceConverter] = None,
) -> List[EvaluationRow]
Parameters:
  • btql_query - The BTQL query string to execute
  • include_tool_calls - Whether to include tool calling information
  • converter - Optional custom converter implementing TraceConverter protocol

upload_scores()

Upload evaluation scores back to Braintrust traces.
def upload_scores(
    self,
    rows: List[EvaluationRow], 
    model_name: str, 
    mean_score: float
) -> None
Parameters:
  • rows - List of EvaluationRow objects with session_data containing trace IDs
  • model_name - Name of the model (used as the score name in Braintrust)
  • mean_score - The calculated mean score to push to Braintrust

Factory Function

For convenience, you can use the factory function:
from eval_protocol.adapters.braintrust import create_braintrust_adapter

adapter = create_braintrust_adapter()

Source Code

The complete implementation is available on GitHub: eval_protocol/adapters/braintrust.py

Tool Calling Support

The adapter automatically handles tool calling traces from Braintrust:
# Include tool calls (default behavior)
btql_query = "select: * from: project_logs('your_project_id') traces limit: 10"
rows = adapter.get_evaluation_rows(btql_query, include_tool_calls=True)

# Exclude tool calls for simpler evaluation
rows = adapter.get_evaluation_rows(btql_query, include_tool_calls=False)
Tool calls are extracted from trace metadata and preserved in the Message format.

Data Conversion

The adapter converts Braintrust traces to EvaluationRow format:

Supported Trace Formats

{
    "input": {
        "messages": [
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hi there!"}
        ]
    },
    "output": [
        {
            "message": {
                "role": "assistant", 
                "content": "Generated response"
            }
        }
    ]
}

Metadata Preservation

The adapter stores the original Braintrust trace ID in the evaluation row metadata:
for row in rows:
    trace_id = row.input_metadata.session_data.get("braintrust_trace_id")
    print(f"Processing trace: {trace_id}")

Advanced Features

Custom Trace Conversion

Implement custom conversion logic using the TraceConverter protocol:
from eval_protocol.adapters.braintrust import TraceConverter
from eval_protocol.models import EvaluationRow, Message

class CustomConverter:
    def __call__(self, trace: Dict[str, Any], include_tool_calls: bool) -> Optional[EvaluationRow]:
        # Your custom conversion logic here
        if not self.should_process_trace(trace):
            return None
            
        messages = self.extract_custom_messages(trace)
        return EvaluationRow(
            messages=messages,
            input_metadata=InputMetadata(
                session_data={"braintrust_trace_id": trace.get("id")}
            )
        )
    
    def should_process_trace(self, trace):
        # Custom filtering logic
        return trace.get("metadata", {}).get("agent_type") == "customer_service"
    
    def extract_custom_messages(self, trace):
        # Custom message extraction
        return [Message(role="user", content="Custom extraction logic")]

# Use custom converter
converter = CustomConverter()
btql_query = "select: * from: project_logs('your_project_id') traces limit: 100"
rows = adapter.get_evaluation_rows(btql_query, converter=converter)

Complete Example

For a production-ready example that integrates Braintrust with Arena-Hard-Auto evaluation, see: llm_judge_braintrust.py This example demonstrates BTQL query usage, multi-model comparison, and automated Arena-Hard-Auto judging to create model leaderboards without writing custom evaluation logic.