Evaluate mathematical reasoning with GSM8K dataset using structured thinking format
<think>...</think><answer>...</answer>
tags.
id
: Unique identifier for the test caseuser_query
: The math word problem to solveground_truth_for_eval
: The expected solution with step-by-step reasoning and final answer<<calculation=result>>
format#### answer
re
: Python’s regex module for pattern matchingtyping
: Python’s typing module for type hints (Any, Dict, List)EvaluateResult
: The result object containing evaluation score and reasoningEvaluationRow
: The data structure containing conversation messages and ground truthMetricResult
: Individual metric results for detailed analysisdefault_single_turn_rollout_processor
: Default processor for single-turn conversationsevaluation_test
: Decorator for configuring evaluation testsmath_reward
: Built-in math evaluation functioncheck_think_answer_format
: Function to validate structured thinking formatgsm8k_to_evaluation_row
: Adapter function to convert GSM8K dataset formatEvaluationRow
with a user message containing the math problem<think>[\s\S]*?</think>
: Matches the thinking section, including any characters and newlines[\s\S]*?
: Matches any characters (including newlines) between the think and answer tags<answer>[\s\S]*?</answer>
: Matches the answer sectionre.search()
: Searches for the pattern anywhere in the text (not requiring it to be the entire text)<think>
and <answer>
sections in the correct order.
@evaluation_test
decorator to configure the evaluation. The evaluation function combines numerical accuracy with format validation.
math_reward
function to check if the final answer matches the ground truth (80% weight)input_dataset
: Path to the GSM8K sample datasetdataset_adapter
: Function that converts GSM8K format to EvaluationRow objectsmodel
: The model to evaluate (Fireworks Kimi model in this case)rollout_input_params
: Model parameters (temperature set to 0.0 for deterministic results)max_dataset_rows
: Limit to 5 test cases for quick evaluationthreshold_of_success
: Set to 0.0 to see all results (can be adjusted based on requirements)rollout_processor
: Uses default single-turn processor for math problemsmode
: pointwise
for evaluating individual rows since each row can be evaluated independentlyevaluation_test_kwargs
: Additional parameters for the evaluation functionmath_reward
Functionmath_reward
function is a built-in evaluation function that extracts numerical answers from text and compares them with expected values. It’s located in eval_protocol.rewards.math
.
Key Features:
tolerance
and absolute_tolerance
parametersrequire_units
parametermessages
: List of conversation messages (extracts from the last assistant message)ground_truth
: Expected answer string containing the correct numerical valuetolerance
: Relative tolerance for floating-point comparisons (default: 0.001)absolute_tolerance
: Absolute tolerance for very small numbers (default: 1e-8)require_units
: Whether to require units to match (default: False)EvaluateResult
with score (1.0 for correct, 0.0 for incorrect) and detailed reasoningcheck_think_answer_format
Function<think>
and <answer>
tags.
Function Signature:
r"<think>[\s\S]*?</think>[\s\S]*?<answer>[\s\S]*?</answer>"
<think>[\s\S]*?</think>
: Matches the thinking section with any content[\s\S]*?
: Matches any characters (including newlines) between sections<answer>[\s\S]*?</answer>
: Matches the answer section with any contentTrue
if both sections are present in the correct order, False
otherwise<think>
section: <answer>18</answer>
<answer>
section: <think>Step by step reasoning...</think>
<answer>18</answer><think>reasoning...</think>
gsm8k_to_evaluation_row
FunctionEvaluationRow
format.
Function Signature:
user_query
and creates a Message
with role “user”ground_truth_for_eval
as the ground truth for comparisonEvaluationRow
objects that the EP framework can process<think>
section: Detailed step-by-step reasoning<answer>
section: Clear final answer