Quick AIME-style math check using boxed final answers
AIME2025
JSONL from Hugging Face. It is intended for quick model picking rather than a full reimplementation of the benchmark.
eval_protocol/benchmarks/suites/aime25.py
and exported as aime25
.\\boxed{...}
.question
and answer
into EvaluationRow
s.@evaluation_test
provides URLs, model, and rollout parameters (including optional reasoning-effort variants).--ep-max-rows=50
to limit dataset size, or --ep-max-rows=all
for the full dataset. You can also use --ep-reasoning-effort=high
and --ep-input-param temperature=0.0
to adjust model settings.
\\boxed{...}
.