You can find the complete code for this example at test_markdown_highlighting.py.
Understanding the Dataset
Before we start coding, let’s understand what we’re working with. Themarkdown_dataset.jsonl
file contains diverse test cases that evaluate a model’s ability to follow markdown formatting instructions.
Dataset Format
Each entry in the dataset contains:key
: Unique identifier for the test caseprompt
: The instruction given to the model, which includes:- Clear task description
- Specific markdown formatting requirements
- Examples of the expected format
- Minimum number of highlights required
num_highlights
: The ground truth value (number of highlighted sections required)
Example Dataset Entries
Creative Writing Tasks:Dataset Characteristics
Diversity: The dataset covers various content types:- Creative writing (songs, poems, raps)
- Business documents (proposals, cover letters)
- Educational content (summaries, blog posts)
- Entertainment (riddles, jokes)
- The markdown syntax to use (
*text*
for italic,**text**
for bold) - Minimum number of highlights required
- Examples of proper formatting
- Context for when highlighting should be applied
Step 1: Import Required Dependencies
Now let’s start coding. First, we import the necessary modules from the EP framework:re
: Python’s regex module for pattern matchingEvaluateResult
: The result object that contains the evaluation score and reasoningEvaluationRow
: Represents a single evaluation test case with messages and ground truthInputMetadata
: Optional metadata for input rows (e.g.,row_id
)Message
: Represents a message in the conversationevaluation_test
: Decorator that configures the evaluation testSingleTurnRolloutProcessor
: Handles the conversation flow for single-turn evaluations
Step 2: Create the Dataset Adapter
We need to create an adapter that converts our dataset format to the EP’s expected format:- Takes the raw dataset as a list of dictionaries
- Converts each row to an
EvaluationRow
with a user message - Sets the ground truth to the required number of highlights
- Returns the list of evaluation rows
Step 3: Define the Evaluation Function
The evaluation function is the core logic that analyzes model responses. To implement this, EP provides a decorator@evaluation_test
that configures the
evaluation with the following parameters:
- The function receives an
EvaluationRow
parameter and returns it with the evaluation result attached - We extract the last message (assistant’s response) using
row.messages[-1].content
- We handle edge cases like empty responses
- The
row.ground_truth
contains the required number of highlighted sections
Step 4: Implement the Analysis Logic
Next, we implement the core logic to count highlighted sections:r"\*[^\n\*]*\*"
: Matches italic text between single asterisks\*
: Literal asterisk[^\n\*]*
: Any characters except newlines and asterisks\*
: Closing asterisk
r"\*\*[^\n\*]*\*\*"
: Matches bold text between double asterisks- We filter out empty highlights to ensure quality
Step 5: Generate the Evaluation Result
Finally, we compare the actual count against requirements and attach the result to the row:score
: 1.0 for success, 0.0 for failurereason
: Human-readable explanation with emojis for clarity- The result is attached to
row.evaluation_result
and the row is returned
Step 6: Configuration Parameters
The@evaluation_test
decorator configures the evaluation with these parameters:
Configuration parameters:
input_dataset
: Path(s) to the JSONL file(s) containing test casesdataset_adapter
: Function that converts raw dataset toEvaluationRow
objectscompletion_params
: Model parameters (e.g.,model
,temperature
,max_tokens
)passed_threshold
: Minimum score required to pass (0.5 = 50% success rate)rollout_processor
: Processor handling single-turn conversation flow (SingleTurnRolloutProcessor
)num_runs
: Number of times to run each test casemode
: Evaluation mode (“pointwise” for individual test case evaluation)
- Understand markdown formatting instructions
- Apply formatting consistently across different content types
- Meet minimum requirements for highlighted sections
- Follow specific formatting patterns