Multiple-choice science QA with simple exact-match scoring
eval_protocol/benchmarks/suites/gpqa.py
and exported as gpqa
.__GT__:A
) per row.@evaluation_test
feeds prebuilt input_messages
and sets rollout parameters.--ep-max-rows=20
to tune runtime. The CSV is fetched at runtime.
A, B, C, D
from the model output.