Eval Protocol integrates with GEPA (via DSPy) to automatically optimize your prompts using the evaluations you’ve already written. GEPA analyzes which examples pass or fail, proposes structured edits to the prompt, and keeps changes that improve your metric.
How It Works
GEPA treats your @evaluation_test as the optimization objective. It:
- Extracts the system prompt from your dataset
- Splits your data into training and validation sets
- Runs your evaluation function on candidate prompts
- Uses a reflection LLM to propose improvements based on failure patterns
- Returns the best-performing prompt
The key insight is that your evaluation’s reason field (in EvaluateResult) tells GEPA why examples failed, enabling targeted improvements.
Prerequisites
Install eval-protocol:
pip install eval-protocol
Set your API key:
export FIREWORKS_API_KEY="your-fireworks-key"
Basic Usage
Write a normal @evaluation_test, then wrap it with GEPATrainer.
Step 1: Define Your Evaluation Test
from eval_protocol.models import EvaluationRow, EvaluateResult
from eval_protocol.pytest.evaluation_test import evaluation_test
from eval_protocol.pytest.default_single_turn_rollout_process import SingleTurnRolloutProcessor
@evaluation_test(
input_dataset=["datasets/my_dataset.jsonl"],
dataset_adapter=my_adapter,
completion_params=[{
"model": "fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct",
"max_tokens": 4096,
}],
rollout_processor=SingleTurnRolloutProcessor(),
mode="pointwise",
)
def test_my_task(row: EvaluationRow) -> EvaluationRow:
predicted = row.get_last_message_content()
expected = row.ground_truth
is_correct = predicted.strip() == expected.strip()
# Feedback tells GEPA why this example failed
if is_correct:
feedback = "Correct answer."
else:
feedback = f"Incorrect. Expected '{expected}', got '{predicted}'."
row.evaluation_result = EvaluateResult(
score=1.0 if is_correct else 0.0,
reason=feedback,
is_score_valid=True,
)
return row
Step 2: Add GEPA Training
from eval_protocol.training import GEPATrainer, build_reflection_lm
if __name__ == "__main__":
trainer = GEPATrainer(
test_my_task,
train_ratio=0.7,
val_ratio=0.3,
)
reflection_lm = build_reflection_lm(
"fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct"
)
optimized_program = trainer.train(
reflection_lm=reflection_lm,
max_metric_calls=1500,
num_threads=4,
)
print(trainer.evaluate(optimized_program))
print(trainer.get_optimized_system_prompt(optimized_program))
Step 3: Run
Single System Prompt RequirementGEPA extracts and optimizes only the first system prompt found in your dataset. All rows must share the same system prompt for GEPA to work correctly.If your dataset contains different system prompts per row (e.g., different personas or task variations), GEPA will only optimize the first one and apply it to all examples, which may produce unexpected results. Consider splitting such datasets into separate optimization runs.
Case Study: Text-to-SQL
The text-to-sql-quickstart repository demonstrates GEPA on a text-to-SQL benchmark.
Repository Structure
text-to-sql-quickstart/
├── data/
│ └── synthetic_openflights.db # DuckDB database with airlines/airports/routes
├── datasets/
│ ├── final_rft_sql_train_data.jsonl # Training examples
│ └── final_rft_sql_test_data.jsonl # Held-out test set
├── mcp_server/ # HTTP server that executes SQL queries
├── evaluator/
│ └── sql_gepa_training.py # GEPA training script
└── scripts/
└── eval_baseline.py # Evaluate prompts on test set
The Database
The benchmark uses a synthetic OpenFlights database (synthetic_openflights.db) containing tables for airlines, airports, countries, planes, and routes. This database is shared between:
- Ground truth generation (SQL queries executed to create expected results)
- The MCP server (executes model-generated SQL during evaluation)
The MCP Server
The MCP server is a simple HTTP service that accepts SQL queries and returns results from the DuckDB database:
# mcp_server/run_mcp_server.py
DB = os.environ.get("DB_PATH", "data/synthetic_openflights.db")
When you run the training script, the MCP server starts automatically. It receives the model’s generated SQL, executes it against the database, and returns the results for comparison with ground truth.
The Evaluation Function
The evaluation compares the model’s SQL output against ground truth by executing both and checking if they return the same data:
def test_sql_generation(row: EvaluationRow) -> EvaluationRow:
generated_sql = row.get_last_message_content()
# Ground truth is stored as list[dict] from the original query
expected_results = row.ground_truth
# Execute the model's SQL via MCP server
actual_results = execute_sql_via_mcp(generated_sql)
# Compare results semantically (values match, ignoring column order)
is_correct = compare_results(expected_results, actual_results)
# Build detailed feedback for GEPA
if is_correct:
feedback = "Query returned correct results."
else:
feedback = analyze_mismatch(expected_results, actual_results)
row.evaluation_result = EvaluateResult(
score=1.0 if is_correct else 0.0,
reason=feedback,
is_score_valid=True,
)
return row
The feedback function provides specific details about failures:
def analyze_mismatch(expected, actual):
issues = []
if len(expected) != len(actual):
issues.append(f"Row count: expected {len(expected)}, got {len(actual)}")
expected_cols = set(expected[0].keys()) if expected else set()
actual_cols = set(actual[0].keys()) if actual else set()
missing = expected_cols - actual_cols
if missing:
issues.append(f"Missing columns: {missing}")
return " | ".join(issues)
Running the Example
- Clone the repository:
git clone https://github.com/eval-protocol/text-to-sql-quickstart
cd text-to-sql-quickstart
pip install -r requirements.txt
- Set your API key:
export FIREWORKS_API_KEY="your-key"
- The repository includes pre-generated data. To run GEPA training:
python evaluator/sql_gepa_training.py
This starts the MCP server automatically, runs GEPA optimization, and prints the optimized prompt.
- To compare original vs optimized prompts on the test set:
python scripts/eval_baseline.py --prompt both
Data Generation (Optional)
If you want to regenerate the synthetic data from scratch:
This runs:
- Download real OpenFlights data
- Generate synthetic rows using an LLM
- Generate SQL queries
- Execute queries to get ground truth results
- Generate natural language questions from SQL
The scripts/08_regenerate_balanced_data.py script generates data with consistent column naming for better train/test distribution.
Results
On this benchmark, GEPA discovered that failures clustered around column alias mismatches (avg_altitude vs average_altitude) and missing columns. It rewrote the prompt to include explicit naming conventions and a validation checklist.
| Metric | Original | Optimized |
|---|
| Test Set | 38.3% | 48.3% |
| Validation | 30.9% | 36.4% |
Configuration
GEPATrainer Parameters
The @evaluation_test decorated function to optimize
Proportion of data for training
Proportion of data for validation
Random seed for dataset splits
Name of the input field in DSPy signature
Name of the output field in DSPy signature
module_type
DSPyModuleType
default:"CHAIN_OF_THOUGHT"
DSPy module type: PREDICT, CHAIN_OF_THOUGHT, or PROGRAM_OF_THOUGHT
train() Parameters
DSPy LM for proposing prompt improvements
Total budget of LLM calls for optimization
Budget preset: “light”, “medium”, or “heavy”. Alternative to max_metric_calls.
reflection_minibatch_size
Number of examples shown to reflection LLM per iteration
Parallel threads for running evaluations
Tips
Provide specific feedback. GEPA learns from your evaluation_result.reason. Instead of “Incorrect”, say “Missing column ‘airport_count’”.
Choose appropriate budget. For small datasets (under 50 examples), use auto="light". For larger datasets, increase to "medium" or "heavy".
Use the right module type. PREDICT for simple tasks, CHAIN_OF_THOUGHT for reasoning tasks, PROGRAM_OF_THOUGHT for code generation.
Keep a held-out test set. GEPA should never see your final test data during optimization.
Troubleshooting
GEPA finds no improvement: Add more detailed feedback, increase reflection_minibatch_size, or increase budget.
API timeouts: Reduce num_threads or use a faster model.
Memory issues: Reduce num_threads or process smaller batches.
Dataset has multiple system prompts: GEPA only optimizes the first system prompt found. If your dataset uses different prompts for different tasks, split it into separate datasets with consistent prompts and run GEPA on each.
Resources
GEPA Paper ·
DSPy Documentation ·
Text-to-SQL Quickstart