eval-protocol
provides an out-of-the-box rollout processor for LangGraph.
LangGraphRolloutProcessor
This orchestrates rollouts for LangGraph apps so you only need to pass a
graph factory function and eval-protocol
will handle running your experiments
against your dataset. The factory accepts a typed RolloutProcessorConfig
.
Graph Factory
To supply a LangGraph app for evaluation, define a factory function that accepts aRolloutProcessorConfig
and returns a compiled graph with .ainvoke
.
In this example, we assume you have a build_simple_graph
function that creates a
LangGraph app using a given model.
completion_params
in RolloutProcessorConfig
to get model name and other
parameters, then construct your LangGraph app accordingly.
Simple Graph Example
Our simple LangGraph app uses LangChain-native messages and a single node that calls the configured model.Writing the Eval
Every eval ineval-protocol
expects an input dataset of type List[EvaluationRow].
For this example, a small JSONL dataset of prompts is used and adapted into EvaluationRow
s via
an adapter function. The rollout processor handles converting EvaluationRow.messages
to
LangChain messages and applies the model output back to the row.
Generating a Score
Evals ineval-protocol
return a score between 0.0 and 1.0. This simple
example scores whether the assistant replied.
Reasoning Model Example
You can also evaluate reasoning models likegpt-oss-120b
and control reasoning via
reasoning_effort
.
Passing reasoning_effort
Use completion_params
to pass reasoning_effort
values like
“low”, “medium”, or “high”.
Running the Evaluation
Creating a Leaderboard
To compare different models, add multiple entries tocompletion_params
and set num_runs
to
get robust evaluation across runs.

Example leaderboard showing model performance comparison in the Pivot View.