Let’s walk through creating an evaluation that checks if model responses contain the required number of highlighted sections (like bold or italic text). This example demonstrates the core concepts of writing evaluations with Eval Protocol (EP).Documentation Index
Fetch the complete documentation index at: https://evalprotocol.io/llms.txt
Use this file to discover all available pages before exploring further.
You can find the complete code for this example at test_markdown_highlighting.py.
Understanding the Dataset
Before we start coding, let’s understand what we’re working with. Themarkdown_dataset.jsonl file contains diverse test cases that evaluate a model’s ability to follow markdown formatting instructions.
Dataset Format
Each entry in the dataset contains:key: Unique identifier for the test caseprompt: The instruction given to the model, which includes:- Clear task description
- Specific markdown formatting requirements
- Examples of the expected format
- Minimum number of highlights required
num_highlights: The ground truth value (number of highlighted sections required)
Example Dataset Entries
Creative Writing Tasks:Dataset Characteristics
Diversity: The dataset covers various content types:- Creative writing (songs, poems, raps)
- Business documents (proposals, cover letters)
- Educational content (summaries, blog posts)
- Entertainment (riddles, jokes)
- The markdown syntax to use (
*text*for italic,**text**for bold) - Minimum number of highlights required
- Examples of proper formatting
- Context for when highlighting should be applied
Step 1: Import Required Dependencies
Now let’s start coding. First, we import the necessary modules from the EP framework:re: Python’s regex module for pattern matchingEvaluateResult: The result object that contains the evaluation score and reasoningEvaluationRow: Represents a single evaluation test case with messages and ground truthInputMetadata: Optional metadata for input rows (e.g.,row_id)Message: Represents a message in the conversationevaluation_test: Decorator that configures the evaluation testSingleTurnRolloutProcessor: Handles the conversation flow for single-turn evaluations
Step 2: Create the Dataset Adapter
We need to create an adapter that converts our dataset format to the EP’s expected format:- Takes the raw dataset as a list of dictionaries
- Converts each row to an
EvaluationRowwith a user message - Sets the ground truth to the required number of highlights
- Returns the list of evaluation rows
Step 3: Define the Evaluation Function
The evaluation function is the core logic that analyzes model responses. To implement this, EP provides a decorator@evaluation_test that configures the
evaluation with the following parameters:
- The function receives an
EvaluationRowparameter and returns it with the evaluation result attached - We extract the last message (assistant’s response) using
row.messages[-1].content - We handle edge cases like empty responses
- The
row.ground_truthcontains the required number of highlighted sections
Step 4: Implement the Analysis Logic
Next, we implement the core logic to count highlighted sections:r"\*[^\n\*]*\*": Matches italic text between single asterisks\*: Literal asterisk[^\n\*]*: Any characters except newlines and asterisks\*: Closing asterisk
r"\*\*[^\n\*]*\*\*": Matches bold text between double asterisks- We filter out empty highlights to ensure quality
Step 5: Generate the Evaluation Result
Finally, we compare the actual count against requirements and attach the result to the row:score: 1.0 for success, 0.0 for failurereason: Human-readable explanation with emojis for clarity- The result is attached to
row.evaluation_resultand the row is returned
Step 6: Configuration Parameters
The@evaluation_test decorator configures the evaluation with these parameters:
Configuration parameters:
input_dataset: Path(s) to the JSONL file(s) containing test casesdataset_adapter: Function that converts raw dataset toEvaluationRowobjectscompletion_params: Model parameters (e.g.,model,temperature,max_tokens)passed_threshold: Minimum score required to pass (0.5 = 50% success rate)rollout_processor: Processor handling single-turn conversation flow (SingleTurnRolloutProcessor)num_runs: Number of times to run each test casemode: Evaluation mode (“pointwise” for individual test case evaluation)
- Understand markdown formatting instructions
- Apply formatting consistently across different content types
- Meet minimum requirements for highlighted sections
- Follow specific formatting patterns

