This example wires up a lightweight AIME-style evaluation using the open AIME2025 JSONL from Hugging Face. It is intended for quick model picking rather than a full reimplementation of the benchmark.
This example is now implemented as a suite in eval_protocol/benchmarks/suites/aime25.py and exported as aime25.

What it does

  • Pulls AIME2025 JSONL directly from Hugging Face.
  • Prompts the model to reason and place the final answer inside \\boxed{...}.
  • Parses the boxed value and compares it against ground truth for exact match scoring.

How it’s configured

Key pieces in the SDK example:
  • Dataset adapter converts raw rows with question and answer into EvaluationRows.
  • @evaluation_test provides URLs, model, and rollout parameters (including optional reasoning-effort variants).
  • Evaluator extracts a final integer from the assistant message and checks equality with the ground truth.

Run it locally

After installing eval-protocol, you can run the benchmark from anywhere:
pytest --pyargs eval_protocol.benchmarks.test_aime25 -v \
  --ep-print-summary --ep-summary-json artifacts/aime25.json
Tip: use --ep-max-rows=50 to limit dataset size, or --ep-max-rows=all for the full dataset. You can also use --ep-reasoning-effort=high and --ep-input-param temperature=0.0 to adjust model settings.

Notes

  • This is a convenience wrapper for model selection, not a canonical reproduction of AIME.
  • The evaluation is strict exact match over a parsed integer from \\boxed{...}.