This example runs a minimal GPQA-style evaluation using the public Diamond split CSV. It’s meant for quick comparisons during model picking, not a full benchmark reproduction.
This example is implemented as a suite in eval_protocol/benchmarks/suites/gpqa.py and exported as gpqa.

What it does

  • Downloads the GPQA Diamond CSV and constructs MCQ prompts (A–D).
  • Appends a system-side ground-truth token (e.g., __GT__:A) per row.
  • Extracts the predicted letter from the assistant’s final message and checks exact match.

How it’s configured

  • @evaluation_test feeds prebuilt input_messages and sets rollout parameters.
  • Simple scoring: 1.0 for exact letter match, else 0.0.

Run it locally

After installing eval-protocol, you can run the benchmark from anywhere:
pytest --pyargs eval_protocol.benchmarks.test_gpqa -v \
  --ep-print-summary --ep-summary-json artifacts/gpqa.json
Use --ep-max-rows=20 to tune runtime. The CSV is fetched at runtime.

Notes

  • Convenience-oriented: focuses on a clean pipeline and minimal metrics.
  • The evaluation relies on extracting exactly one of A, B, C, D from the model output.