GPQA (Open-Resource)

This example runs a minimal GPQA-style evaluation using the public Diamond split CSV. It’s meant for quick comparisons during model picking, not a full benchmark reproduction.

This example is implemented as a suite in eval_protocol/benchmarks/suites/gpqa.py and exported as gpqa.

What it does

Downloads the GPQA Diamond CSV and constructs MCQ prompts (A–D).
Appends a system-side ground-truth token (e.g., __GT__:A) per row.
Extracts the predicted letter from the assistant’s final message and checks exact match.

How it’s configured

@evaluation_test feeds prebuilt input_messages and sets rollout parameters.
Simple scoring: 1.0 for exact letter match, else 0.0.

Run it locally

After installing eval-protocol, you can run the benchmark from anywhere:

pytest --pyargs eval_protocol.benchmarks.test_gpqa -v \
  --ep-print-summary --ep-summary-json artifacts/gpqa.json

Use --ep-max-rows=20 to tune runtime. The CSV is fetched at runtime.

Notes

Convenience-oriented: focuses on a clean pipeline and minimal metrics.
The evaluation relies on extracting exactly one of A, B, C, D from the model output.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

What it does

How it’s configured

Run it locally

Notes

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​What it does

​How it’s configured

​Run it locally

​Notes

What it does

How it’s configured

Run it locally

Notes