Evaluate competitive programming abilities using APPS dataset with comprehensive test suites
problem_id
: Unique identifier for the problemquestion
: Detailed problem description with constraints, examples, and input/output formatsolutions
: Array of reference Python solutions that correctly solve the probleminput_output
: JSON containing comprehensive test cases with inputs and expected outputsdifficulty
: Classification as “introductory”, “interview”, or “competition”url
: Source URL of the original problem from competitive programming platformsstarter_code
: Optional template code to begin implementationjson
: For parsing the complex input/output test case datatyping
: Python’s typing module for type hintsEvaluateResult
, EvaluationRow
, Message
: Core EP data structuresdefault_single_turn_rollout_processor
: Default processor for single-turn conversationsevaluation_test
: Decorator for configuring evaluation testsevaluate_apps_solution
: Specialized function for evaluating APPS competitive programming solutions@evaluation_test
decorator to configure the APPS evaluation:
input_dataset
: Path to the APPS dataset JSONL filemodel
: The model to evaluate (uses a capable model for complex problems)rollout_input_params
: Model parameters with higher token limit for complex solutionsthreshold_of_success
: 33% success rate threshold (competitive programming is challenging)mode
: pointwise
for evaluating individual problems independentlydataset_adapter
: Function that converts APPS format to EvaluationRow objectsrollout_processor
: Uses default single-turn processorevaluate_apps_solution
Functionevaluate_apps_solution
function is a specialized evaluation function designed for competitive programming problems that handles complex test case execution and scoring.
Key Features:
messages
: List of conversation messages (problem statement from user, solution from assistant)ground_truth
: JSON string containing test cases with inputs and expected outputs**kwargs
: Additional parameters including execution timeout settingsEvaluateResult
with pass rate score (0.0 to 1.0) and detailed metricsevaluate_apps_solution
function implements a comprehensive evaluation pipeline with robust security and error handling:
1. Code Extraction Process: