Single-Turn Eval

Let’s walk through creating an evaluation that checks if model responses contain the required number of highlighted sections (like bold or italic text). This example demonstrates the core concepts of writing evaluations with Eval Protocol (EP).

You can find the complete code for this example at test_markdown_highlighting.py.

Understanding the Dataset

Before we start coding, let’s understand what we’re working with. The markdown_dataset.jsonl file contains diverse test cases that evaluate a model’s ability to follow markdown formatting instructions.

Dataset Format

Each entry in the dataset contains:

key: Unique identifier for the test case
prompt: The instruction given to the model, which includes:
- Clear task description
- Specific markdown formatting requirements
- Examples of the expected format
- Minimum number of highlights required
num_highlights: The ground truth value (number of highlighted sections required)

Example Dataset Entries

Creative Writing Tasks:

{
  "key": 1773,
  "prompt": "Write a song about the summers of my childhood that I spent in the countryside. Give the song a name, and highlight the name by wrapping it with *. For example: *little me in the countryside*.",
  "num_highlights": 1
}

Business and Professional Content:

{
  "key": 167,
  "prompt": "Generate a business proposal to start a sweatshirt company in Bremen. The proposal should contain 5 or more sections. Highlight each section name using the this format:\n*section name*",
  "num_highlights": 5
}

Educational and Informational Content:

{
  "key": 3453,
  "prompt": "Summarize the history of Japan. Italicize at least 5 keywords in your response. To indicate a italic word, wrap it with asterisk, like *italic*",
  "num_highlights": 5
}

Dataset Characteristics

Diversity: The dataset covers various content types:

Creative writing (songs, poems, raps)
Business documents (proposals, cover letters)
Educational content (summaries, blog posts)
Entertainment (riddles, jokes)

Formatting Instructions: Each prompt clearly specifies:

The markdown syntax to use (*text* for italic, **text** for bold)
Minimum number of highlights required
Examples of proper formatting
Context for when highlighting should be applied

Realistic Scenarios: The prompts simulate real-world use cases where markdown formatting is important for readability and emphasis.

Step 1: Import Required Dependencies

Now let’s start coding. First, we import the necessary modules from the EP framework:

import re
from typing import Any, Dict, List

from eval_protocol.models import EvaluateResult, EvaluationRow, InputMetadata, Message
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test

re: Python’s regex module for pattern matching
EvaluateResult: The result object that contains the evaluation score and reasoning
EvaluationRow: Represents a single evaluation test case with messages and ground truth
InputMetadata: Optional metadata for input rows (e.g., row_id)
Message: Represents a message in the conversation
evaluation_test: Decorator that configures the evaluation test
SingleTurnRolloutProcessor: Handles the conversation flow for single-turn evaluations

Step 2: Create the Dataset Adapter

We need to create an adapter that converts our dataset format to the EP’s expected format:

def markdown_dataset_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
    """
    Convert entries from markdown dataset to EvaluationRow objects.
    """
    return [
        EvaluationRow(
            messages=[Message(role="user", content=row["prompt"])],
            ground_truth=str(row["num_highlights"]),
            input_metadata=InputMetadata(row_id=str(row["key"]))
        )
        for row in data
    ]

This adapter:

Takes the raw dataset as a list of dictionaries
Converts each row to an EvaluationRow with a user message
Sets the ground truth to the required number of highlights
Returns the list of evaluation rows

Step 3: Define the Evaluation Function

The evaluation function is the core logic that analyzes model responses. To implement this, EP provides a decorator @evaluation_test that configures the evaluation with the following parameters:

@evaluation_test(
    input_dataset=["tests/pytest/data/markdown_dataset.jsonl"],
    dataset_adapter=markdown_dataset_to_evaluation_row,
    completion_params=[
        {"temperature": 0.0, "max_tokens": 4096, "model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"}
    ],
    passed_threshold=0.5,
    rollout_processor=SingleTurnRolloutProcessor(),
    num_runs=1,
    mode="pointwise",
)
def test_markdown_highlighting_evaluation(row: EvaluationRow) -> EvaluationRow:
    """
    Evaluation function that checks if the model's response contains the required number of formatted sections.
    """
    # Extract the assistant's response from the conversation
    assistant_response = row.messages[-1].content

    # Handle empty responses
    if not assistant_response:
        row.evaluation_result = EvaluateResult(score=0.0, reason="❌ No assistant response found")
        return row

    # Convert ground truth to required number of highlights
    required_highlights = int(row.ground_truth)

Key points:

The function receives an EvaluationRow parameter and returns it with the evaluation result attached
We extract the last message (assistant’s response) using row.messages[-1].content
We handle edge cases like empty responses
The row.ground_truth contains the required number of highlighted sections

Step 4: Implement the Analysis Logic

Next, we implement the core logic to count highlighted sections:

    # Count highlighted sections (**bold** or *italic*)
    actual_count = 0
    
    # Find italic text patterns (*text*)
    highlights = re.findall(r"\*[^\n\*]*\*", assistant_response)
    
    # Find bold text patterns (**text**)
    double_highlights = re.findall(r"\*\*[^\n\*]*\*\*", assistant_response)
    
    # Count valid italic highlights (non-empty content)
    for highlight in highlights:
        if highlight.strip("*").strip():
            actual_count += 1
    
    # Count valid bold highlights (non-empty content)
    for highlight in double_highlights:
        if highlight.removeprefix("**").removesuffix("**").strip():
            actual_count += 1

Regex patterns explained:

r"\*[^\n\*]*\*": Matches italic text between single asterisks
- \*: Literal asterisk
- [^\n\*]*: Any characters except newlines and asterisks
- \*: Closing asterisk
r"\*\*[^\n\*]*\*\*": Matches bold text between double asterisks
We filter out empty highlights to ensure quality

Step 5: Generate the Evaluation Result

Finally, we compare the actual count against requirements and attach the result to the row:

    # Determine if the response meets the requirement
    meets_requirement = actual_count >= required_highlights
    
    if meets_requirement:
        row.evaluation_result = EvaluateResult(
            score=1.0,
            reason=f"✅ Found {actual_count} highlighted sections (required: {required_highlights})"
        )
    else:
        row.evaluation_result = EvaluateResult(
            score=0.0,
            reason=f"❌ Only found {actual_count} highlighted sections (required: {required_highlights})"
        )
    return row

Result structure:

score: 1.0 for success, 0.0 for failure
reason: Human-readable explanation with emojis for clarity
The result is attached to row.evaluation_result and the row is returned

Step 6: Configuration Parameters

The @evaluation_test decorator configures the evaluation with these parameters: Configuration parameters:

input_dataset: Path(s) to the JSONL file(s) containing test cases
dataset_adapter: Function that converts raw dataset to EvaluationRow objects
completion_params: Model parameters (e.g., model, temperature, max_tokens)
passed_threshold: Minimum score required to pass (0.5 = 50% success rate)
rollout_processor: Processor handling single-turn conversation flow (SingleTurnRolloutProcessor)
num_runs: Number of times to run each test case
mode: Evaluation mode (“pointwise” for individual test case evaluation)

This comprehensive dataset ensures that the evaluation tests the model’s ability to:

Understand markdown formatting instructions
Apply formatting consistently across different content types
Meet minimum requirements for highlighted sections
Follow specific formatting patterns

This example demonstrates how to create robust, reusable evaluations that can be integrated into CI/CD pipelines, model comparison workflows, and fine-tuning processes.

Setup

UI

Tutorial

Integrations

Reference

Understanding the Dataset

Dataset Format

Example Dataset Entries

Dataset Characteristics

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Define the Evaluation Function

Step 4: Implement the Analysis Logic

Step 5: Generate the Evaluation Result

Step 6: Configuration Parameters

Setup

UI

Tutorial

Integrations

Reference

​Understanding the Dataset

​Dataset Format

​Example Dataset Entries

​Dataset Characteristics

​Step 1: Import Required Dependencies

​Step 2: Create the Dataset Adapter

​Step 3: Define the Evaluation Function

​Step 4: Implement the Analysis Logic

​Step 5: Generate the Evaluation Result

​Step 6: Configuration Parameters

Understanding the Dataset

Dataset Format

Example Dataset Entries

Dataset Characteristics

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Define the Evaluation Function

Step 4: Implement the Analysis Logic

Step 5: Generate the Evaluation Result

Step 6: Configuration Parameters