APPS Coding Evaluation

This example demonstrates how to create comprehensive competitive programming evaluations using the APPS (Automated Programming Progress Standard) dataset from CodeParrot. The evaluation tests AI models’ ability to solve complex algorithmic challenges similar to those found in competitive programming contests.

You can find the complete code for this example at test_apps_coding.py.

Understanding APPS Coding Evaluation

APPS coding evaluation assesses a model’s ability to:

Solve complex algorithmic problems: Handle competitive programming challenges with multiple constraints
Implement sophisticated logic: Design algorithms for graph theory, dynamic programming, and data structures
Handle multiple test cases: Pass comprehensive test suites with edge cases and boundary conditions
Work with competitive formats: Process standard input/output formats used in programming contests

Unlike basic coding tasks that test simple function implementation, APPS evaluation tests advanced algorithmic thinking and competitive programming skills - essential for building AI systems capable of complex problem-solving.

Understanding the APPS Dataset Structure

The APPS dataset from CodeParrot contains 10,000 competitive programming problems sourced from platforms like Codeforces, AtCoder, Kattis, and Codewars, providing realistic algorithmic challenges at three difficulty levels.

Dataset Format

Each entry in the APPS dataset contains:

problem_id: Unique identifier for the problem
question: Detailed problem description with constraints, examples, and input/output format
solutions: Array of reference Python solutions that correctly solve the problem
input_output: JSON containing comprehensive test cases with inputs and expected outputs
difficulty: Classification as “introductory”, “interview”, or “competition”
url: Source URL of the original problem from competitive programming platforms
starter_code: Optional template code to begin implementation

Example APPS Dataset Entry

Competitive Programming Problem:

{
  "id": 1,
  "question": "Mikhail walks on a Cartesian plane. He starts at the point $(0, 0)$, and in one move he can go to any of eight adjacent points. For example, if Mikhail is currently at the point $(0, 0)$, he can go to any of the following points in one move: $(1, 0)$; $(1, 1)$; $(0, 1)$; $(-1, 1)$; $(-1, 0)$; $(-1, -1)$; $(0, -1)$; $(1, -1)$.\n\nIf Mikhail goes from the point $(x1, y1)$ to the point $(x2, y2)$ in one move, and $x1 \ne x2$ and $y1 \ne y2$, then such a move is called a diagonal move.\n\nMikhail has $q$ queries. For the $i$-th query Mikhail's target is to go to the point $(n_i, m_i)$ from the point $(0, 0)$ in exactly $k_i$ moves...",
  "solutions": [
    "q=int(input())\n\nfor e in range(q):\n    x,y,k=list(map(int,input().split()))\n    x,y=abs(x),abs(y)\n    x,y=max(x,y),min(x,y)\n    # ... complete solution"
  ],
  "input_output": {
    "inputs": [
      "3
      2 2 3
      4 3 7
      10 1 9"
    ],
    "outputs": [
      "1
      6
      -1"
    ]
  },
  "difficulty": "interview",
  "url": "https://codeforces.com/problemset/problem/1036/B",
  "starter_code": ""
}

Dataset Characteristics

Problem Complexity: APPS problems feature advanced algorithmic concepts:

Graph algorithms: Shortest paths, minimum spanning trees, graph traversal
Dynamic programming: Optimization problems with overlapping subproblems
Data structures: Advanced usage of heaps, trees, and custom data structures
Mathematical algorithms: Number theory, combinatorics, and geometric problems
String algorithms: Pattern matching, string manipulation, and parsing

Difficulty Progression:

Introductory (2,889 problems): Basic algorithmic concepts and simple implementations
Interview (3,592 problems): Common coding interview problems with moderate complexity
Competition (572 problems): Advanced competitive programming challenges

Test Coverage: Comprehensive testing ensures robust evaluation:

Multiple test cases: Average of 21.2 test cases per problem
Edge cases: Boundary conditions and corner cases included
Performance constraints: Problems include time and memory limits
Real contest data: Authentic test cases from actual programming competitions

Sample Dataset: The EP python-sdk includes a sample APPS dataset with just 3 problems for testing and demonstration purposes. The full CodeParrot APPS dataset contains 10,000 problems across all difficulty levels.

Step 1: Import Required Dependencies

First, we import the necessary modules from the EP framework:

import json
from typing import Any, Dict, List

from eval_protocol.models import EvaluateResult, EvaluationRow, Message
from eval_protocol.pytest import SingleTurnRolloutProcessor, evaluation_test
from eval_protocol.rewards.apps_coding_reward import evaluate_apps_solution

json: For parsing the complex input/output test case data
typing: Python’s typing module for type hints
EvaluateResult, EvaluationRow, Message: Core EP data structures
default_single_turn_rollout_processor: Default processor for single-turn conversations
evaluation_test: Decorator for configuring evaluation tests
evaluate_apps_solution: Specialized function for evaluating APPS competitive programming solutions

Step 2: Create the Dataset Adapter

We need to convert the APPS dataset format to the EP’s expected format:

def apps_dataset_to_evaluation_row(data: List[Dict[str, Any]]) -> List[EvaluationRow]:
    """
    Convert entries from APPS dataset to EvaluationRow objects.
    
    This adapter extracts the problem statement and stores the comprehensive
    test cases (input/output pairs) as ground truth for evaluation.
    
    Args:
        data: List of APPS dataset entries with problem descriptions and test cases
        
    Returns:
        List of EvaluationRow objects ready for evaluation
    """
    return [
        EvaluationRow(
            messages=[Message(role="user", content=row["question"])],
            ground_truth=row["input_output"]
        )
        for row in data
    ]

This adapter:

Uses the complete problem description as the user message
Stores the JSON test case data as ground truth for comprehensive evaluation
Preserves the complex input/output format required for competitive programming
Creates proper Message objects for the evaluation framework

Key transformations:

Problem preservation: Maintains full problem statements with constraints and examples
Test case handling: Preserves multiple test cases with complex input/output formats
Ground truth format: Keeps JSON structure for sophisticated evaluation logic

Step 3: Configure and Run the Evaluation

We use the @evaluation_test decorator to configure the APPS evaluation:

@evaluation_test(
    input_dataset=["tests/pytest/data/apps_sample_dataset.jsonl"],
    dataset_adapter=apps_dataset_to_evaluation_row,
    completion_params=[{"model": "accounts/fireworks/models/kimi-k2-instruct", "temperature": 0.0, "max_tokens": 4096}],
    passed_threshold=0.33,
    rollout_processor=SingleTurnRolloutProcessor(),
    num_runs=1,
    mode="pointwise",
)
def test_apps_code_evaluation(row: EvaluationRow) -> EvaluationRow:
    """
    Evaluation function that tests APPS coding problems using evaluate_apps_solution.

    Args:
        row: EvaluationRow containing the conversation messages and ground_truth as JSON string
    
    Returns:
        EvaluationRow with the evaluation result
    """
    # Use evaluate_apps_solution directly
    result = evaluate_apps_solution(
        messages=row.messages,
        ground_truth=row.ground_truth,
    )
    
    # Set the evaluation result on the row
    row.evaluation_result = result
    
    return row

Configuration parameters:

input_dataset: Path to the APPS dataset JSONL file
model: The model to evaluate (uses a capable model for complex problems)
rollout_input_params: Model parameters with higher token limit for complex solutions
threshold_of_success: 33% success rate threshold (competitive programming is challenging)
mode: pointwise for evaluating individual problems independently
dataset_adapter: Function that converts APPS format to EvaluationRow objects
rollout_processor: Uses default single-turn processor

Evaluation process:

Problem presentation: Present the full competitive programming problem to the model
Solution generation: Model generates a complete algorithmic solution
Code extraction: Extract Python code from the model’s response
Comprehensive testing: Run solution against all test cases in the problem
Pass rate calculation: Calculate percentage of test cases passed

Core Functions Explained

`evaluate_apps_solution` Function

The evaluate_apps_solution function is a specialized evaluation function designed for competitive programming problems that handles complex test case execution and scoring. Key Features:

Code extraction: Identifies and extracts Python code from model responses
Test case parsing: Processes JSON test case data with multiple input/output pairs
Secure execution: Runs code safely with timeouts and resource limitations
Comprehensive scoring: Calculates pass rates across all test cases
Error handling: Provides detailed feedback on compilation and runtime errors
Competitive format support: Handles standard input/output format used in contests

Function Signature:

def evaluate_apps_solution(
    messages: List[Message], 
    ground_truth: Optional[str], 
    **kwargs
) -> EvaluateResult:

Parameters:

messages: List of conversation messages (problem statement from user, solution from assistant)
ground_truth: JSON string containing test cases with inputs and expected outputs
**kwargs: Additional parameters including execution timeout settings

Return Value:

EvaluateResult with pass rate score (0.0 to 1.0) and detailed metrics

Implementation Details

The evaluate_apps_solution function implements a comprehensive evaluation pipeline with robust security and error handling: 1. Code Extraction Process:

# Extract Python code from model response
code_solution = _extract_python_code(raw_solution_content)

# Handles various response formats:
# - Markdown code blocks: ```python ... ```
# - Inline code snippets  
# - Mixed text and code responses
# - Removes verbose explanations and comments

2. Ground Truth Processing:

# Parse JSON test case data
if isinstance(ground_truth, str):
    in_outs = json.loads(ground_truth)  # Parse JSON string
elif isinstance(ground_truth, dict):
    in_outs = ground_truth  # Already parsed by JSONL loader

# Validate required structure
assert "inputs" in in_outs and "outputs" in in_outs

3. Secure Test Execution: The evaluation uses sandboxed execution with comprehensive security measures:

# Force standard input execution path and prepare secure environment
in_outs_for_check = in_outs.copy()
if "fn_name" in in_outs_for_check:
    del in_outs_for_check["fn_name"]  # Use stdin/stdout testing

# For each test case in the problem:
for i, (test_input, expected_output) in enumerate(zip(inputs, outputs)):
    # Prepare secure execution environment
    wrapped_code = f"""
import sys
sys.setrecursionlimit(6*10**5)
{standard_imports}  # Common competitive programming imports

{user_generated_code}
"""
    
    # Execute in isolated subprocess with resource limits
    process = subprocess.run(
        [sys.executable, "-c", wrapped_code],
        input=test_input,
        capture_output=True,
        timeout=timeout,
        text=True
    )
    
    # Compare outputs and record result
    if process.returncode == 0:
        actual_output = process.stdout.strip()
        results.append(actual_output == expected_output.strip())
    else:
        results.append(False)  # Runtime error

Security Features:

Sandboxed execution: Code runs in isolated subprocess with resource limits
Standard I/O redirection: Test inputs via stdin, outputs captured from stdout
Security restrictions: File system access, network operations, and dangerous imports disabled
Resource monitoring: Memory usage, CPU time, and execution duration tracked
Timeout enforcement: Long-running or infinite loops automatically terminated

4. Scoring and Error Analysis:

# Calculate pass rate from results
actual_results = results_list  # List of True/False for each test case
num_tests = len(actual_results)
passed_count = sum(1 for res in actual_results if res is True)
score = float(passed_count) / num_tests

# Process execution metadata for detailed error reporting
if exec_metadata_list:
    if len(exec_metadata_list) == 1 and exec_metadata_list[0].get("error"):
        # Global compilation error
        reason_msg += f" Execution Error: {exec_metadata_list[0]['error']}"
    elif score == 0.0 and exec_metadata_list[0].get("error_message") == "Wrong Answer":
        # Detailed failure analysis with specific test case details
        first_fail_meta = exec_metadata_list[0]
        reason_msg += (
            f". First fail details: Inputs: {first_fail_meta.get('inputs', 'N/A')}, "
            f"Expected: {first_fail_meta.get('expected', 'N/A')}, "
            f"Got: {first_fail_meta.get('output', 'N/A')}"
        )

Error Handling Hierarchy:

Code extraction failure: Score 0.0 - No valid Python code found
Compilation errors: Score 0.0 - Syntax errors prevent execution
Runtime errors: Per-test-case failure - Exceptions during execution
Timeout errors: Per-test-case failure - Exceeded time limits
Wrong output: Per-test-case failure - Incorrect results but valid execution
Perfect execution: Score 1.0 - All test cases pass with correct outputs

Result Types:

True: Test case passed with correct output
False: Test case failed (wrong output)
-1: Runtime error or timeout
-2: Compilation error

Example Evaluation Flow:

# Problem: Mikhail's diagonal moves (from example above)
# Model generates solution
result = evaluate_apps_solution(
    messages=[Message(role="user", content=problem_description)],
    ground_truth='{"inputs": ["3\\n2 2 3\\n4 3 7\\n10 1 9\\n"], "outputs": ["1\\n6\\n-1\\n"]}'
)

# Result might be:
# EvaluateResult(
#     score=1.0,  # All test cases passed
#     reason="Passed 1/1 test cases",
#     metrics={
#         "pass_rate": MetricResult(score=1.0, reason="1/1"),
#         "execution_metadata": MetricResult(...)
#     }
# )

Evaluation Scenarios and Results

The APPS coding evaluation handles various competitive programming scenarios:

Perfect Solution (Score: 1.0)

Scenario: Model correctly solves all test cases

# Problem: Mikhail's diagonal moves
# Model provides optimal solution using coordinate geometry
q = int(input())
for _ in range(q):
    x, y, k = list(map(int, input().split()))
    x, y = abs(x), abs(y)
    # ... correct algorithm implementation
    # Handles all coordinate movement constraints

# Result: ✅ Passed 3/3 test cases (100% success rate)

Partial Solution (Score: 0.67)

Scenario: Model solves most test cases but fails on edge cases

# Problem: Mikhail's diagonal moves  
# Model has correct main logic but misses boundary condition
q = int(input())
for _ in range(q):
    x, y, k = list(map(int, input().split()))
    # ... mostly correct implementation
    # Fails on impossible movement case

# Result: ⚠️ Passed 2/3 test cases (67% success rate)

Algorithmic Error (Score: 0.0)

Scenario: Model uses incorrect algorithm approach

# Problem: Mikhail's diagonal moves
# Model uses incorrect movement calculation
q = int(input())
for _ in range(q):
    x, y, k = list(map(int, input().split()))
    # Incorrect approach - doesn't consider diagonal optimization
    print(k)  # Always outputs k regardless of constraints

# Result: ❌ Passed 0/3 test cases - Wrong algorithmic approach

Timeout Error (Score: 0.0)

Scenario: Model solution exceeds time limits

# Problem: Mikhail's diagonal moves
# Model uses inefficient brute force instead of mathematical approach
q = int(input())
for _ in range(q):
    x, y, k = list(map(int, input().split()))
    # Simulates all possible paths - exponential complexity
    # Times out on larger coordinate values

# Result: ❌ Execution timeout - Algorithm too slow for constraints

Compilation Error (Score: 0.0)

Scenario: Model generates syntactically incorrect code

# Problem: Mikhail's diagonal moves
# Model has syntax errors
q = int(input())
for _ in range(q)  # Missing colon
    x, y, k = list(map(int, input().split()))
    # ... rest of solution

# Result: ❌ Compilation error: SyntaxError - Invalid Python syntax

Conclusion

This APPS coding evaluation demonstrates how to assess AI models’ competitive programming capabilities using comprehensive algorithmic challenges. The evaluation ensures models can understand complex problem statements, design efficient algorithms, and implement solutions that pass rigorous test suites. This evaluation is particularly valuable for:

Algorithmic reasoning assessment: Testing advanced problem-solving capabilities
Competitive programming preparation: Validating solutions against contest-quality problems
Algorithm implementation: Ensuring correct and efficient code generation

The APPS evaluation focuses on algorithmic correctness and efficiency rather than simple function implementation, making it essential for building AI systems capable of sophisticated problem-solving. It provides comprehensive testing with real competitive programming challenges and detailed performance metrics.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

Understanding APPS Coding Evaluation

Understanding the APPS Dataset Structure

Dataset Format

Example APPS Dataset Entry

Dataset Characteristics

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Configure and Run the Evaluation

Core Functions Explained

`evaluate_apps_solution` Function

Implementation Details

Evaluation Scenarios and Results

Perfect Solution (Score: 1.0)

Partial Solution (Score: 0.67)

Algorithmic Error (Score: 0.0)

Timeout Error (Score: 0.0)

Compilation Error (Score: 0.0)

Conclusion

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​Understanding APPS Coding Evaluation

​Understanding the APPS Dataset Structure

​Dataset Format

​Example APPS Dataset Entry

​Dataset Characteristics

​Step 1: Import Required Dependencies

​Step 2: Create the Dataset Adapter

​Step 3: Configure and Run the Evaluation

​Core Functions Explained

​evaluate_apps_solution Function

​Implementation Details

​Evaluation Scenarios and Results

​Perfect Solution (Score: 1.0)

​Partial Solution (Score: 0.67)

​Algorithmic Error (Score: 0.0)

​Timeout Error (Score: 0.0)

​Compilation Error (Score: 0.0)

​Conclusion

Understanding APPS Coding Evaluation

Understanding the APPS Dataset Structure

Dataset Format

Example APPS Dataset Entry

Dataset Characteristics

Step 1: Import Required Dependencies

Step 2: Create the Dataset Adapter

Step 3: Configure and Run the Evaluation

Core Functions Explained

`evaluate_apps_solution` Function

Implementation Details

Evaluation Scenarios and Results

Perfect Solution (Score: 1.0)

Partial Solution (Score: 0.67)

Algorithmic Error (Score: 0.0)

Timeout Error (Score: 0.0)

Compilation Error (Score: 0.0)

Conclusion