LiveBench — Data Analysis

This example showcases three LiveBench Data Analysis tasks wired into Eval Protocol with minimal scoring ports adapted from the original benchmark: CTA, Table Join, and Table Reformat.

Suites live in the Python SDK under eval_protocol/benchmarks/suites/livebench_data_analysis.py and are exported as runnable benchmarks.

What it includes

CTA: case-insensitive exact/suffix match over cleaned strings
Table Join: F1 over key-value mappings recovered from model output
Table Reformat: strict table equivalence with parser fallbacks; version auto-selects by release date

Run from CLI (exported benchmark)

After installing eval-protocol, you can run the composite benchmark from anywhere:

pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis -v \
  --ep-print-summary \
  --ep-summary-json artifacts/live_bench_data_analysis.json

This composite benchmark aggregates the three tasks with a final combined summary.

Run each task individually

pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_cta -v \
  --ep-print-summary --ep-summary-json artifacts/cta.json

pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_tablejoin -v \
  --ep-print-summary --ep-summary-json artifacts/tablejoin.json

pytest --pyargs eval_protocol.benchmarks.test_live_bench_data_analysis_tablereformat -v \
  --ep-print-summary --ep-summary-json artifacts/tablereformat.json

Notes

Uses datasets to pull livebench/data_analysis at import time.
Scoring is intentionally lightweight and aims for compatibility with LiveBench behavior (e.g., tolerant parsing, suffix matches, and defensive fallbacks), not an official reproduction.

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

LiveBench — Data Analysis

What it includes

Run from CLI (exported benchmark)

Run each task individually

Notes

Tutorials

Examples

Integrations

Concepts

Reference

Open-Resource Benchmarks

​What it includes

​Run from CLI (exported benchmark)

​Run each task individually

​Notes

What it includes

Run from CLI (exported benchmark)

Run each task individually

Notes