AIME 2025 (Open-Resource)

This example wires up a lightweight AIME-style evaluation using the open AIME2025 JSONL from Hugging Face. It is intended for quick model picking rather than a full reimplementation of the benchmark.

This example is now implemented as a suite in eval_protocol/benchmarks/suites/aime25.py and exported as aime25.

What it does

Pulls AIME2025 JSONL directly from Hugging Face.
Prompts the model to reason and place the final answer inside \\boxed{...}.
Parses the boxed value and compares it against ground truth for exact match scoring.

How it’s configured

Key pieces in the SDK example:

Dataset adapter converts raw rows with question and answer into EvaluationRows.
@evaluation_test provides URLs, model, and rollout parameters (including optional reasoning-effort variants).
Evaluator extracts a final integer from the assistant message and checks equality with the ground truth.

Run it locally

After installing eval-protocol, you can run the benchmark from anywhere:

pytest --pyargs eval_protocol.benchmarks.test_aime25 -v \
  --ep-print-summary --ep-summary-json artifacts/aime25.json

Tip: use --ep-max-rows=50 to limit dataset size, or --ep-max-rows=all for the full dataset. You can also use --ep-reasoning-effort=high and --ep-input-param temperature=0.0 to adjust model settings.

Notes

This is a convenience wrapper for model selection, not a canonical reproduction of AIME.
The evaluation is strict exact match over a parsed integer from \\boxed{...}.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

AIME 2025 (Open-Resource)

What it does

How it’s configured

Run it locally

Notes

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​What it does

​How it’s configured

​Run it locally

​Notes

What it does

How it’s configured

Run it locally

Notes