This example showcases three LiveBench Data Analysis tasks wired into Eval Protocol with minimal scoring ports adapted from the original benchmark: CTA, Table Join, and Table Reformat.
Suites live in the Python SDK under eval_protocol/benchmarks/suites/livebench_data_analysis.py and are exported as runnable benchmarks.

What it includes

  • CTA: case-insensitive exact/suffix match over cleaned strings
  • Table Join: F1 over key-value mappings recovered from model output
  • Table Reformat: strict table equivalence with parser fallbacks; version auto-selects by release date

Run from CLI (exported benchmark)

After installing eval-protocol, you can run the composite benchmark from anywhere:
pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis -v \
  --ep-print-summary \
  --ep-summary-json artifacts/live_bench_data_analysis.json
This composite benchmark aggregates the three tasks with a final combined summary.

Run each task individually

pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_cta -v \
  --ep-print-summary --ep-summary-json artifacts/cta.json

pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_tablejoin -v \
  --ep-print-summary --ep-summary-json artifacts/tablejoin.json

pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_tablereformat -v \
  --ep-print-summary --ep-summary-json artifacts/tablereformat.json

Notes

  • Uses datasets to pull livebench/data_analysis at import time.
  • Scoring is intentionally lightweight and aims for compatibility with LiveBench behavior (e.g., tolerant parsing, suffix matches, and defensive fallbacks), not an official reproduction.