Create your first static single-turn eval
markdown_dataset.jsonl
file contains diverse test cases that evaluate a model’s ability to follow markdown formatting instructions.
key
: Unique identifier for the test caseprompt
: The instruction given to the model, which includes:
num_highlights
: The ground truth value (number of highlighted sections required)*text*
for italic, **text**
for bold)re
: Python’s regex module for pattern matchingEvaluateResult
: The result object that contains the evaluation score and reasoningEvaluationRow
: Represents a single evaluation test case with messages and ground truthMessage
: Represents a message in the conversationevaluation_test
: Decorator that configures the evaluation testdefault_single_turn_rollout_processor
: Function that handles the conversation flow for single-turn evaluationsEvaluationRow
with a user message@evaluation_test
that configures the
evaluation with the following parameters:
EvaluationRow
parameter and returns it with the evaluation result attachedrow.messages[-1].content
row.ground_truth
contains the required number of highlighted sectionsr"\*[^\n\*]*\*"
: Matches italic text between single asterisks
\*
: Literal asterisk[^\n\*]*
: Any characters except newlines and asterisks\*
: Closing asteriskr"\*\*[^\n\*]*\*\*"
: Matches bold text between double asterisksscore
: 1.0 for success, 0.0 for failurereason
: Human-readable explanation with emojis for clarityrow.evaluation_result
and the row is returned@evaluation_test
decorator configures the evaluation with these parameters:
Configuration parameters:
input_dataset
: Path to the JSONL file containing test casesdataset_adapter
: Function that converts raw dataset to EvaluationRow objectsmodel
: The model to evaluate (Fireworks Kimi model in this case)rollout_input_params
: Model parameters (temperature, max tokens)threshold_of_success
: Minimum score required to pass (0.5 = 50% success rate)rollout_processor
: Function that handles the conversation flow (default_single_turn_rollout_processor for single-turn evaluations)num_runs
: Number of times to run each test casemode
: Evaluation mode (“pointwise” for individual test case evaluation)