You can find the complete code for this example at test_markdown_highlighting.py.
Understanding the Dataset
Before we start coding, let’s understand what we’re working with. Themarkdown_dataset.jsonl file contains diverse test cases that evaluate a model’s ability to follow markdown formatting instructions.
Dataset Format
Each entry in the dataset contains:key: Unique identifier for the test caseprompt: The instruction given to the model, which includes:- Clear task description
- Specific markdown formatting requirements
- Examples of the expected format
- Minimum number of highlights required
num_highlights: The ground truth value (number of highlighted sections required)
Example Dataset Entries
Creative Writing Tasks:Dataset Characteristics
Diversity: The dataset covers various content types:- Creative writing (songs, poems, raps)
- Business documents (proposals, cover letters)
- Educational content (summaries, blog posts)
- Entertainment (riddles, jokes)
- The markdown syntax to use (
*text*for italic,**text**for bold) - Minimum number of highlights required
- Examples of proper formatting
- Context for when highlighting should be applied
Step 1: Import Required Dependencies
Now let’s start coding. First, we import the necessary modules from the EP framework:re: Python’s regex module for pattern matchingEvaluateResult: The result object that contains the evaluation score and reasoningEvaluationRow: Represents a single evaluation test case with messages and ground truthInputMetadata: Optional metadata for input rows (e.g.,row_id)Message: Represents a message in the conversationevaluation_test: Decorator that configures the evaluation testSingleTurnRolloutProcessor: Handles the conversation flow for single-turn evaluations
Step 2: Create the Dataset Adapter
We need to create an adapter that converts our dataset format to the EP’s expected format:- Takes the raw dataset as a list of dictionaries
- Converts each row to an
EvaluationRowwith a user message - Sets the ground truth to the required number of highlights
- Returns the list of evaluation rows
Step 3: Define the Evaluation Function
The evaluation function is the core logic that analyzes model responses. To implement this, EP provides a decorator@evaluation_test that configures the
evaluation with the following parameters:
- The function receives an
EvaluationRowparameter and returns it with the evaluation result attached - We extract the last message (assistant’s response) using
row.messages[-1].content - We handle edge cases like empty responses
- The
row.ground_truthcontains the required number of highlighted sections
Step 4: Implement the Analysis Logic
Next, we implement the core logic to count highlighted sections:r"\*[^\n\*]*\*": Matches italic text between single asterisks\*: Literal asterisk[^\n\*]*: Any characters except newlines and asterisks\*: Closing asterisk
r"\*\*[^\n\*]*\*\*": Matches bold text between double asterisks- We filter out empty highlights to ensure quality
Step 5: Generate the Evaluation Result
Finally, we compare the actual count against requirements and attach the result to the row:score: 1.0 for success, 0.0 for failurereason: Human-readable explanation with emojis for clarity- The result is attached to
row.evaluation_resultand the row is returned
Step 6: Configuration Parameters
The@evaluation_test decorator configures the evaluation with these parameters:
Configuration parameters:
input_dataset: Path(s) to the JSONL file(s) containing test casesdataset_adapter: Function that converts raw dataset toEvaluationRowobjectscompletion_params: Model parameters (e.g.,model,temperature,max_tokens)passed_threshold: Minimum score required to pass (0.5 = 50% success rate)rollout_processor: Processor handling single-turn conversation flow (SingleTurnRolloutProcessor)num_runs: Number of times to run each test casemode: Evaluation mode (“pointwise” for individual test case evaluation)
- Understand markdown formatting instructions
- Apply formatting consistently across different content types
- Meet minimum requirements for highlighted sections
- Follow specific formatting patterns

