Basic Coding Evaluation

This example demonstrates how to create comprehensive basic coding evaluations using the Eval Protocol (EP) framework. The evaluation uses code execution functions to test whether models can write correct Python functions that produce expected outputs when executed with specific inputs.

You can find the complete code for this example at test_basic_coding.py.

Understanding Basic Coding Evaluation

Basic coding evaluation assesses a model’s ability to:

Write syntactically correct code: Generate valid Python syntax without errors
Implement correct logic: Create functions that perform the specified operations
Handle different inputs: Process various input values correctly (positive, negative, zero, edge cases)
Produce exact outputs: Return results that match expected values precisely

Unlike text-based evaluations that focus on natural language generation, coding evaluations test a model’s programming capabilities and logical reasoning - essential skills for AI systems that need to write functional code.

Understanding the Dataset Structure

The basic coding dataset contains simple programming tasks that evaluate fundamental coding skills, from arithmetic operations to data structure manipulation.

Dataset Format

Each entry in the dataset contains:

prompt: The coding task description specifying what function to write
input: Test input value to pass to the function
expected_output: The correct output the function should return

Example Dataset Entries

Simple Addition Function:

{
  "prompt": "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1.",
  "input": "5",
  "expected_output": "6"
}

Multiplication Function:

{
  "prompt": "Write a Python function `multiply_by_two` that takes an integer and returns the integer multiplied by 2.",
  "input": "3",
  "expected_output": "6"
}

List Operations:

{
  "prompt": "Write a Python function `get_length` that takes a list and returns its length.",
  "input": "[1, 2, 3]",
  "expected_output": "3"
}

Step 1: Import Required Dependencies

First, we import the necessary modules from the EP framework:

from typing import Any, Dict, List
from eval_protocol.models import EvaluateResult, EvaluationRow, Message
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
from eval_protocol.rewards.code_execution import extract_code_blocks, execute_python_code

typing: Python’s typing module for type hints (Any, Dict, List)
EvaluateResult: Result object containing evaluation score and reasoning
EvaluationRow: Data structure containing conversation messages and ground truth
Message: Individual message in the conversation
default_single_turn_rollout_processor: Default processor for single-turn conversations
evaluation_test: Decorator for configuring evaluation tests
extract_code_blocks: Function to extract Python code from markdown code blocks
execute_python_code: Function to safely execute Python code and capture output

Step 2: Create the Dataset Adapter

We need to convert the basic coding dataset format to the EP’s expected format:

def coding_dataset_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
    """
    Convert entries from coding dataset to EvaluationRow objects.
    
    This adapter combines the coding prompt with the test input to create
    a complete user message, and stores the expected output as ground truth
    for comparison during evaluation.
    
    Args:
        data: List of coding dataset entries with prompt, input, and expected_output
        
    Returns:
        List of EvaluationRow objects ready for evaluation
    """
    return [
        EvaluationRow(
            messages=[Message(role="user", content=f"{row['prompt']} Input: {row['input']}")], 
            ground_truth=row["expected_output"]
        )
        for row in data
    ]

This adapter:

Combines the coding prompt with the test input into a single user message
Stores the expected output as ground truth for comparison
Creates Message objects with the proper role and content structure
Returns a list of EvaluationRow objects that the framework can process

Key transformations:

Message construction: Combines prompt and input into clear instructions
Ground truth preservation: Maintains expected output for exact comparison
Role assignment: Sets proper user role for the coding request

Step 3: Configure and Run the Evaluation

We use the @evaluation_test decorator to configure the evaluation:

@evaluation_test(
    input_dataset=["tests/pytest/data/basic_coding_dataset.jsonl"],
    dataset_adapter=coding_dataset_to_evaluation_row,
    completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct", "temperature": 0.0, "max_tokens": 4096}],
    passed_threshold=0.8,
    rollout_processor=SingleTurnRolloutProcessor(),
    num_runs=1,
    mode="pointwise",
)
async def test_coding_code_evaluation(row: EvaluationRow) -> EvaluationRow:
    """
    Evaluation function that tests code correctness by executing it locally.
    
    This function:
    1. Extracts Python code from the assistant's response
    2. Executes the code locally with timeout=10
    3. Compares the output to ground_truth
    4. Returns a score of 1.0 if output matches, 0.0 otherwise
    
    Args:
        row: EvaluationRow containing the conversation messages and expected_output in ground_truth
        
    Returns:
        EvaluationRow with the evaluation result
    """
    # Check if we have an assistant response
    if len(row.messages) < 2 or row.messages[-1].role != "assistant":
        row.evaluation_result = EvaluateResult(score=0.0, reason="No assistant response found")
        return row
    
    assistant_content = row.messages[-1].content or ""
    expected_output = (row.ground_truth or "").strip()
    
    # Extract Python code blocks
    code_blocks = extract_code_blocks(assistant_content, language="python")
    if not code_blocks:
        row.evaluation_result = EvaluateResult(score=0.0, reason="No Python code block found")
        return row
    
    code = code_blocks[0]["code"]
    
    # Execute the code locally
    execution_result = execute_python_code(code, timeout=10)
    
    if not execution_result.get("success", False):
        error_msg = execution_result.get("error", "Code execution failed")
        row.evaluation_result = EvaluateResult(score=0.0, reason=f"Execution error: {error_msg}")
        return row
    
    # Compare output with expected
    actual_output = (execution_result.get("output", "") or "").strip()
    
    if actual_output == expected_output:
        row.evaluation_result = EvaluateResult(
            score=1.0, 
            reason=f"✅ Output matches: '{actual_output}'"
        )
    else:
        row.evaluation_result = EvaluateResult(
            score=0.0, 
            reason=f"❌ Expected: '{expected_output}', Got: '{actual_output}'"
        )
    
    return row

Configuration parameters:

input_dataset: Path to the basic coding dataset JSONL file
model: The model to evaluate (Fireworks Kimi model in this case)
rollout_input_params: Model parameters including temperature=0.0 for deterministic results
threshold_of_success: 80% success rate threshold for the evaluation
mode: pointwise for evaluating individual rows independently
dataset_adapter: Function that converts coding format to EvaluationRow objects
rollout_processor: Uses default single-turn processor for coding evaluations

Evaluation process:

Validate response: Ensure we have a valid assistant response containing code
Extract code: Use extract_code_blocks to find Python code in markdown blocks
Execute safely: Run the code in a secure environment with timeout protection
Compare output: Perform exact string comparison between actual and expected results
Return score: Provide binary score (1.0 for exact match, 0.0 for any difference)

Core Functions Explained

`extract_code_blocks` Function

The extract_code_blocks function identifies and extracts Python code from the model’s response, typically from markdown code blocks. Key Features:

Markdown parsing: Identifies ```python code blocks in responses
Language filtering: Can filter for specific programming languages
Content cleaning: Removes verbose explanatory text that might interfere with execution
Multiple blocks: Can extract multiple code blocks if present

Function Signature:

def extract_code_blocks(text: str, language: Optional[str] = None) -> List[Dict[str, str]]:

Parameters:

text: The assistant’s response containing code
language: Optional language filter (e.g., “python”)

Return Value:

List of dictionaries with “code” and “language” keys

Example Usage:

response = """
Here's the solution:

\`\`\`python
def add_one(x):
    return x + 1
\`\`\`

This function takes an integer and returns it incremented by 1.
"""

code_blocks = extract_code_blocks(response, language="python")
print(code_blocks[0]["code"])  # "def add_one(x):\n    return x + 1"

`execute_python_code` Function

The execute_python_code function safely executes Python code in a controlled environment with security restrictions and resource limits. Key Features:

Secure execution: Runs code in a subprocess with memory and time limits
Safety guards: Disables dangerous operations like file system access
Timeout protection: Prevents infinite loops and long-running code
Error handling: Captures and reports execution errors clearly
Output capture: Returns both stdout and stderr from execution

Function Signature:

def execute_python_code(code: str, timeout: int = 5) -> Dict[str, Any]:

Parameters:

code: Python code to execute
timeout: Maximum execution time in seconds

Return Value:

Dictionary with execution results including success status, output, and errors

Example Usage:

code = """
def add_one(x):
    return x + 1

result = add_one(5)
print(result)
"""

result = execute_python_code(code, timeout=10)
if result["success"]:
    print(f"Output: {result['output']}")  # "Output: 6"
else:
    print(f"Error: {result['error']}")

Security and Safety Features

The code execution environment includes several safety measures: Resource Limits:

Memory limits: Restricts memory usage to prevent excessive consumption
CPU limits: Prevents long-running computations
Timeout enforcement: Kills processes that exceed time limits

Disabled Operations:

File system access: Prevents reading/writing files
Network operations: Blocks network requests
System calls: Disables potentially dangerous system operations
Process spawning: Prevents creating new processes

Error Handling:

Exception capture: Catches and reports Python exceptions
Timeout detection: Identifies and reports timeout errors
Resource exhaustion: Handles memory and CPU limit violations

Evaluation Scenarios and Results

The basic coding evaluation handles various scenarios with different outcomes:

Perfect Implementation (Score: 1.0)

Scenario: Model writes correct function that produces expected output

# User prompt: "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1. Input: 5"
# Model response:
def add_one(x):
    return x + 1

result = add_one(5)
print(result)

Result: ✅ Output matches: ‘6’ - Function correctly implements the required logic

Syntax Error (Score: 0.0)

Scenario: Model writes code with syntax errors

# User prompt: "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1. Input: 5"
# Model response:
def add_one(x)  # Missing colon
    return x + 1

result = add_one(5)
print(result)

Result: ❌ Execution error: SyntaxError - Invalid Python syntax prevents execution

Logic Error (Score: 0.0)

Scenario: Model writes syntactically correct but logically incorrect code

# User prompt: "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1. Input: 5"
# Model response:
def add_one(x):
    return x + 2  # Wrong logic: adds 2 instead of 1

result = add_one(5)
print(result)

Result: ❌ Expected: ‘6’, Got: ‘7’ - Logic error produces wrong output

Missing Function Call (Score: 0.0)

Scenario: Model defines function but doesn’t call it with the input

# User prompt: "Write a Python function `add_one` that takes an integer and returns the integer incremented by 1. Input: 5"
# Model response:
def add_one(x):
    return x + 1
# Missing: result = add_one(5)
# Missing: print(result)

Result: ❌ Expected: ‘6’, Got: ” - No output produced

Runtime Error (Score: 0.0)

Scenario: Model writes code that fails during execution

# User prompt: "Write a Python function `get_length` that takes a list and returns its length. Input: [1, 2, 3]"
# Model response:
def get_length(lst):
    return lst.length()  # Wrong method: should use len()

result = get_length([1, 2, 3])
print(result)

Result: ❌ Execution error: AttributeError - Runtime error during function call

Edge Case Handling (Score: 1.0)

Scenario: Model correctly handles edge cases like empty lists or zero values

# User prompt: "Write a Python function `get_length` that takes a list and returns its length. Input: []"
# Model response:
def get_length(lst):
    return len(lst)

result = get_length([])
print(result)

Result: ✅ Output matches: ‘0’ - Correctly handles empty list edge case

Conclusion

This basic coding evaluation demonstrates how to assess AI models’ programming capabilities using code execution and output comparison. The evaluation ensures models can write syntactically correct code, implement proper logic, handle various inputs, and produce exact expected outputs. This evaluation is particularly valuable for:

AI model assessment: Evaluating language models’ programming capabilities
Code generation tools: Validating the correctness of automatically generated code
Algorithm testing: Ensuring implementations produce correct results

The basic coding evaluation focuses on functional correctness rather than code style or efficiency, making it essential for building reliable AI systems that can write working code. It provides objective scoring with secure execution, immediate feedback, and scalable automated testing.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

Understanding Basic Coding Evaluation

Understanding the Dataset Structure

Dataset Format

Example Dataset Entries

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Configure and Run the Evaluation

Core Functions Explained

`extract_code_blocks` Function

`execute_python_code` Function

Security and Safety Features

Evaluation Scenarios and Results

Perfect Implementation (Score: 1.0)

Syntax Error (Score: 0.0)

Logic Error (Score: 0.0)

Missing Function Call (Score: 0.0)

Runtime Error (Score: 0.0)

Edge Case Handling (Score: 1.0)

Conclusion

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​Understanding Basic Coding Evaluation

​Understanding the Dataset Structure

​Dataset Format

​Example Dataset Entries

​Step 1: Import Required Dependencies

​Step 2: Create the Dataset Adapter

​Step 3: Configure and Run the Evaluation

​Core Functions Explained

​extract_code_blocks Function

​execute_python_code Function

​Security and Safety Features

​Evaluation Scenarios and Results

​Perfect Implementation (Score: 1.0)

​Syntax Error (Score: 0.0)

​Logic Error (Score: 0.0)

​Missing Function Call (Score: 0.0)

​Runtime Error (Score: 0.0)

​Edge Case Handling (Score: 1.0)

​Conclusion

Understanding Basic Coding Evaluation

Understanding the Dataset Structure

Dataset Format

Example Dataset Entries

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Configure and Run the Evaluation

Core Functions Explained

`extract_code_blocks` Function

`execute_python_code` Function

Security and Safety Features

Evaluation Scenarios and Results

Perfect Implementation (Score: 1.0)

Syntax Error (Score: 0.0)

Logic Error (Score: 0.0)

Missing Function Call (Score: 0.0)

Runtime Error (Score: 0.0)

Edge Case Handling (Score: 1.0)

Conclusion