GEPA Prompt Optimizer

Eval Protocol integrates with GEPA (via DSPy) to automatically optimize your prompts using the evaluations you’ve already written. GEPA analyzes which examples pass or fail, proposes structured edits to the prompt, and keeps changes that improve your metric.

How It Works

GEPA treats your @evaluation_test as the optimization objective. It:

Extracts the system prompt from your dataset
Splits your data into training and validation sets
Runs your evaluation function on candidate prompts
Uses a reflection LLM to propose improvements based on failure patterns
Returns the best-performing prompt

The key insight is that your evaluation’s reason field (in EvaluateResult) tells GEPA why examples failed, enabling targeted improvements.

Prerequisites

Install eval-protocol:

pip install eval-protocol

Set your API key:

export FIREWORKS_API_KEY="your-fireworks-key"

Basic Usage

Write a normal @evaluation_test, then wrap it with GEPATrainer.

Step 1: Define Your Evaluation Test

my_eval.py

from eval_protocol.models import EvaluationRow, EvaluateResult
from eval_protocol.pytest.evaluation_test import evaluation_test
from eval_protocol.pytest.default_single_turn_rollout_process import SingleTurnRolloutProcessor

@evaluation_test(
    input_dataset=["datasets/my_dataset.jsonl"],
    dataset_adapter=my_adapter,
    completion_params=[{
        "model": "fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct",
        "max_tokens": 4096,
    }],
    rollout_processor=SingleTurnRolloutProcessor(),
    mode="pointwise",
)
def test_my_task(row: EvaluationRow) -> EvaluationRow:
    predicted = row.get_last_message_content()
    expected = row.ground_truth
    
    is_correct = predicted.strip() == expected.strip()
    
    # Feedback tells GEPA why this example failed
    if is_correct:
        feedback = "Correct answer."
    else:
        feedback = f"Incorrect. Expected '{expected}', got '{predicted}'."
    
    row.evaluation_result = EvaluateResult(
        score=1.0 if is_correct else 0.0,
        reason=feedback,
        is_score_valid=True,
    )
    return row

Step 2: Add GEPA Training

my_eval.py

from eval_protocol.training import GEPATrainer, build_reflection_lm

if __name__ == "__main__":
    trainer = GEPATrainer(
        test_my_task,
        train_ratio=0.7,
        val_ratio=0.3,
    )
    
    reflection_lm = build_reflection_lm(
        "fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct"
    )
    
    optimized_program = trainer.train(
        reflection_lm=reflection_lm,
        max_metric_calls=1500,
        num_threads=4,
    )
    
    print(trainer.evaluate(optimized_program))
    print(trainer.get_optimized_system_prompt(optimized_program))

Step 3: Run

python my_eval.py

Single System Prompt RequirementGEPA extracts and optimizes only the first system prompt found in your dataset. All rows must share the same system prompt for GEPA to work correctly.If your dataset contains different system prompts per row (e.g., different personas or task variations), GEPA will only optimize the first one and apply it to all examples, which may produce unexpected results. Consider splitting such datasets into separate optimization runs.

Case Study: Text-to-SQL

The text-to-sql-quickstart repository demonstrates GEPA on a text-to-SQL benchmark.

Repository Structure

text-to-sql-quickstart/
├── data/
│   └── synthetic_openflights.db    # DuckDB database with airlines/airports/routes
├── datasets/
│   ├── final_rft_sql_train_data.jsonl   # Training examples
│   └── final_rft_sql_test_data.jsonl    # Held-out test set
├── mcp_server/                     # HTTP server that executes SQL queries
├── evaluator/
│   └── sql_gepa_training.py        # GEPA training script
└── scripts/
    └── eval_baseline.py            # Evaluate prompts on test set

The Database

The benchmark uses a synthetic OpenFlights database (synthetic_openflights.db) containing tables for airlines, airports, countries, planes, and routes. This database is shared between:

Ground truth generation (SQL queries executed to create expected results)
The MCP server (executes model-generated SQL during evaluation)

The MCP Server

The MCP server is a simple HTTP service that accepts SQL queries and returns results from the DuckDB database:

# mcp_server/run_mcp_server.py
DB = os.environ.get("DB_PATH", "data/synthetic_openflights.db")

When you run the training script, the MCP server starts automatically. It receives the model’s generated SQL, executes it against the database, and returns the results for comparison with ground truth.

The Evaluation Function

The evaluation compares the model’s SQL output against ground truth by executing both and checking if they return the same data:

def test_sql_generation(row: EvaluationRow) -> EvaluationRow:
    generated_sql = row.get_last_message_content()
    
    # Ground truth is stored as list[dict] from the original query
    expected_results = row.ground_truth
    
    # Execute the model's SQL via MCP server
    actual_results = execute_sql_via_mcp(generated_sql)
    
    # Compare results semantically (values match, ignoring column order)
    is_correct = compare_results(expected_results, actual_results)
    
    # Build detailed feedback for GEPA
    if is_correct:
        feedback = "Query returned correct results."
    else:
        feedback = analyze_mismatch(expected_results, actual_results)
    
    row.evaluation_result = EvaluateResult(
        score=1.0 if is_correct else 0.0,
        reason=feedback,
        is_score_valid=True,
    )
    return row

The feedback function provides specific details about failures:

def analyze_mismatch(expected, actual):
    issues = []
    
    if len(expected) != len(actual):
        issues.append(f"Row count: expected {len(expected)}, got {len(actual)}")
    
    expected_cols = set(expected[0].keys()) if expected else set()
    actual_cols = set(actual[0].keys()) if actual else set()
    
    missing = expected_cols - actual_cols
    if missing:
        issues.append(f"Missing columns: {missing}")
    
    return " | ".join(issues)

Running the Example

Clone the repository:

git clone https://github.com/eval-protocol/text-to-sql-quickstart
cd text-to-sql-quickstart
pip install -r requirements.txt

Set your API key:

export FIREWORKS_API_KEY="your-key"

The repository includes pre-generated data. To run GEPA training:

python evaluator/sql_gepa_training.py

This starts the MCP server automatically, runs GEPA optimization, and prints the optimized prompt.

To compare original vs optimized prompts on the test set:

python scripts/eval_baseline.py --prompt both

Data Generation (Optional)

If you want to regenerate the synthetic data from scratch:

make all-data

This runs:

Download real OpenFlights data
Generate synthetic rows using an LLM
Generate SQL queries
Execute queries to get ground truth results
Generate natural language questions from SQL

The scripts/08_regenerate_balanced_data.py script generates data with consistent column naming for better train/test distribution.

Results

On this benchmark, GEPA discovered that failures clustered around column alias mismatches (avg_altitude vs average_altitude) and missing columns. It rewrote the prompt to include explicit naming conventions and a validation checklist.

Metric	Original	Optimized
Test Set	38.3%	48.3%
Validation	30.9%	36.4%

Configuration

GEPATrainer Parameters

test_fn

TestFunction

required

The @evaluation_test decorated function to optimize

train_ratio

float

default:"0.8"

Proportion of data for training

val_ratio

float

default:"0.1"

Proportion of data for validation

seed

int

default:"42"

Random seed for dataset splits

input_field

str

default:"problem"

Name of the input field in DSPy signature

output_field

str

default:"answer"

Name of the output field in DSPy signature

module_type

DSPyModuleType

default:"CHAIN_OF_THOUGHT"

DSPy module type: PREDICT, CHAIN_OF_THOUGHT, or PROGRAM_OF_THOUGHT

train() Parameters

reflection_lm

DSPy LM for proposing prompt improvements

max_metric_calls

int

Total budget of LLM calls for optimization

auto

str

Budget preset: “light”, “medium”, or “heavy”. Alternative to max_metric_calls.

reflection_minibatch_size

int

default:"3"

Number of examples shown to reflection LLM per iteration

num_threads

int

Parallel threads for running evaluations

Tips

Provide specific feedback. GEPA learns from your evaluation_result.reason. Instead of “Incorrect”, say “Missing column ‘airport_count’”. Choose appropriate budget. For small datasets (under 50 examples), use auto="light". For larger datasets, increase to "medium" or "heavy". Use the right module type. PREDICT for simple tasks, CHAIN_OF_THOUGHT for reasoning tasks, PROGRAM_OF_THOUGHT for code generation. Keep a held-out test set. GEPA should never see your final test data during optimization.

Troubleshooting

GEPA finds no improvement: Add more detailed feedback, increase reflection_minibatch_size, or increase budget. API timeouts: Reduce num_threads or use a faster model. Memory issues: Reduce num_threads or process smaller batches. Dataset has multiple system prompts: GEPA only optimizes the first system prompt found. If your dataset uses different prompts for different tasks, split it into separate datasets with consistent prompts and run GEPA on each.

Resources

GEPA Paper · DSPy Documentation · Text-to-SQL Quickstart

Getting Started

Integrations

Using the Logs UI

Reference

How It Works

Prerequisites

Basic Usage

Step 1: Define Your Evaluation Test

Step 2: Add GEPA Training

Step 3: Run

Case Study: Text-to-SQL

Repository Structure

The Database

The MCP Server

The Evaluation Function

Running the Example

Data Generation (Optional)

Results

Configuration

GEPATrainer Parameters

train() Parameters

Tips

Troubleshooting

Resources

Getting Started

Integrations

Using the Logs UI

Reference

​How It Works

​Prerequisites

​Basic Usage

​Step 1: Define Your Evaluation Test

​Step 2: Add GEPA Training

​Step 3: Run

​Case Study: Text-to-SQL

​Repository Structure

​The Database

​The MCP Server

​The Evaluation Function

​Running the Example

​Data Generation (Optional)

​Results

​Configuration

​GEPATrainer Parameters

​train() Parameters

​Tips

​Troubleshooting

​Resources

How It Works

Prerequisites

Basic Usage

Step 1: Define Your Evaluation Test

Step 2: Add GEPA Training

Step 3: Run

Case Study: Text-to-SQL

Repository Structure

The Database

The MCP Server

The Evaluation Function

Running the Example

Data Generation (Optional)

Results

Configuration

GEPATrainer Parameters

train() Parameters

Tips

Troubleshooting

Resources