Evaluate code correctness by executing Python functions and comparing outputs
prompt
: The coding task description specifying what function to writeinput
: Test input value to pass to the functionexpected_output
: The correct output the function should returntyping
: Python’s typing module for type hints (Any, Dict, List)EvaluateResult
: Result object containing evaluation score and reasoningEvaluationRow
: Data structure containing conversation messages and ground truthMessage
: Individual message in the conversationdefault_single_turn_rollout_processor
: Default processor for single-turn conversationsevaluation_test
: Decorator for configuring evaluation testsextract_code_blocks
: Function to extract Python code from markdown code blocksexecute_python_code
: Function to safely execute Python code and capture output@evaluation_test
decorator to configure the evaluation:
input_dataset
: Path to the basic coding dataset JSONL filemodel
: The model to evaluate (Fireworks Kimi model in this case)rollout_input_params
: Model parameters including temperature=0.0 for deterministic resultsthreshold_of_success
: 80% success rate threshold for the evaluationmode
: pointwise
for evaluating individual rows independentlydataset_adapter
: Function that converts coding format to EvaluationRow objectsrollout_processor
: Uses default single-turn processor for coding evaluationsextract_code_blocks
to find Python code in markdown blocksextract_code_blocks
Functionextract_code_blocks
function identifies and extracts Python code from the model’s response, typically from markdown code blocks.
Key Features:
text
: The assistant’s response containing codelanguage
: Optional language filter (e.g., “python”)execute_python_code
Functionexecute_python_code
function safely executes Python code in a controlled environment with security restrictions and resource limits.
Key Features:
code
: Python code to executetimeout
: Maximum execution time in seconds