HealthBench (Open-Resource)

This example provides a minimal, rubric-driven proxy inspired by HealthBench—for quick sanity checks in clinical-style prompts. It is not a comprehensive or official reimplementation.

This example is now implemented as a suite in eval_protocol/benchmarks/suites/healthbench.py and exported as healthbench.

What it does

Uses a few in-memory prompts with small rubric lists.
Extracts simple keyword requirements from rubric criteria (e.g., “hospital”, “urgent”, “hydration”, “rest”).
Scores 1.0 if the assistant’s response contains any required rubric keywords; otherwise 0.0.

How it’s configured

@evaluation_test sets a small temperature and token budget.
Messages are constructed inline; rubrics are mapped by prompt string.

Run it locally

After installing eval-protocol, you can run the benchmark from anywhere:

pytest --pyargs eval_protocol.benchmarks.test_healthbench -v \
  --ep-print-summary --ep-summary-json artifacts/healthbench.json

Notes

This is a minimal proxy to surface safety/quality cues—not a validated clinical benchmark.
You can expand the rubric list or keyword extraction as needed for your domain.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

HealthBench (Open-Resource)

What it does

How it’s configured

Run it locally

Notes

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​What it does

​How it’s configured

​Run it locally

​Notes

What it does

How it’s configured

Run it locally

Notes