To make it easy to build a model leaderboard for LangGraph apps, eval-protocol provides an out-of-the-box rollout processor for LangGraph.

LangGraphRolloutProcessor

This orchestrates rollouts for LangGraph apps so you only need to pass a graph factory function and eval-protocol will handle running your experiments against your dataset. The factory accepts a typed RolloutProcessorConfig.
@evaluation_test(
    input_dataset=["examples/langgraph/data/simple_prompts.jsonl"],
    dataset_adapter=adapter,
    completion_params=[
        {
            "model": "accounts/fireworks/models/kimi-k2-instruct",
            "temperature": 0.0,
        },
    ],
    rollout_processor=processor,
)
async def test_langgraph_pointwise(row: EvaluationRow) -> EvaluationRow:
    ...
See reference for details.

Graph Factory

To supply a LangGraph app for evaluation, define a factory function that accepts a RolloutProcessorConfig and returns a compiled graph with .ainvoke. In this example, we assume you have a build_simple_graph function that creates a LangGraph app using a given model.
from typing import Any
from eval_protocol.pytest.types import RolloutProcessorConfig
from eval_protocol.pytest.default_langchain_rollout_processor import LangGraphRolloutProcessor
from examples.langgraph.simple_graph import build_simple_graph


def graph_factory(config: RolloutProcessorConfig) -> Any:
    cp = config.completion_params or {}
    model = cp.get("model") or "accounts/fireworks/models/kimi-k2-instruct"
    temperature = cp.get("temperature", 0.0)
    return build_simple_graph(model=model, model_provider="fireworks", temperature=temperature)


processor = LangGraphRolloutProcessor(graph_factory=graph_factory)
Use completion_params in RolloutProcessorConfig to get model name and other parameters, then construct your LangGraph app accordingly.

Simple Graph Example

See example code in the repository: test file and graph.
Our simple LangGraph app uses LangChain-native messages and a single node that calls the configured model.
from typing import Any, Dict, List
from typing_extensions import TypedDict, Annotated


def build_simple_graph(
    model: str = "accounts/fireworks/models/kimi-k2-instruct",
    *,
    model_provider: str = "fireworks",
    temperature: float = 0.0,
) -> Any:
    from langgraph.graph import StateGraph, END
    from langgraph.graph.message import add_messages
    from langchain_core.messages import BaseMessage
    from langchain.chat_models import init_chat_model

    class State(TypedDict):
        messages: Annotated[List[BaseMessage], add_messages]

    llm = init_chat_model(model, model_provider=model_provider, temperature=temperature)

    async def call_model(state: State, **_: Any) -> Dict[str, Any]:
        messages: List[BaseMessage] = state.get("messages", [])
        resp = await llm.ainvoke(messages)
        return {"messages": [resp]}

    g = StateGraph(State)
    g.add_node("call_model", call_model)
    g.set_entry_point("call_model")
    g.add_edge("call_model", END)
    return g.compile()

Writing the Eval

Every eval in eval-protocol expects an input dataset of type List[EvaluationRow]. For this example, a small JSONL dataset of prompts is used and adapted into EvaluationRows via an adapter function. The rollout processor handles converting EvaluationRow.messages to LangChain messages and applies the model output back to the row.

Generating a Score

Evals in eval-protocol return a score between 0.0 and 1.0. This simple example scores whether the assistant replied.
has_reply = 1.0 if any(m.role == "assistant" for m in (row.messages or [])) else 0.0
row.evaluation_result = EvaluateResult(
    score=has_reply,
    reason="assistant replied" if has_reply else "no assistant reply",
)

Reasoning Model Example

You can also evaluate reasoning models like gpt-oss-120b and control reasoning via reasoning_effort.
Example code: test file and graph.
from typing import Any, Dict, List
from typing_extensions import Annotated, TypedDict


def build_reasoning_graph(
    *,
    model: str = "accounts/fireworks/models/gpt-oss-120b",
    model_provider: str = "fireworks",
    temperature: float = 0.0,
    reasoning_effort: str | None = None,
) -> Any:
    from langgraph.graph import StateGraph, END
    from langgraph.graph.message import add_messages
    from langchain.chat_models import init_chat_model
    from langchain_core.messages import BaseMessage

    class State(TypedDict):
        messages: Annotated[List[BaseMessage], add_messages]

    llm = init_chat_model(
        model,
        model_provider=model_provider,
        temperature=temperature,
        reasoning_effort=reasoning_effort,
    )

    async def call_model(state: State) -> Dict[str, Any]:
        response = await llm.ainvoke(state["messages"])  # type: ignore[assignment]
        return {"messages": [response]}

    g = StateGraph(State)
    g.add_node("call_model", call_model)
    g.set_entry_point("call_model")
    g.add_edge("call_model", END)
    return g.compile()

Passing reasoning_effort

Use completion_params to pass reasoning_effort values like “low”, “medium”, or “high”.
@evaluation_test(
    input_dataset=["examples/langgraph/data/simple_prompts.jsonl"],
    dataset_adapter=adapter,
    rollout_processor=processor,
    completion_params=[
        {"model": "accounts/fireworks/models/gpt-oss-120b", "temperature": 0.0, "reasoning_effort": "low"}
    ],
    mode="pointwise",
)

Running the Evaluation

export FIREWORKS_API_KEY=... && \
pytest python-sdk/examples/langgraph/test_langgraph_rollout.py -v

Creating a Leaderboard

To compare different models, add multiple entries to completion_params and set num_runs to get robust evaluation across runs.
# Assuming `processor = LangGraphRolloutProcessor(graph_factory=graph_factory)` as shown above
@evaluation_test(
    input_dataset=["examples/langgraph/data/simple_prompts.jsonl"],
    dataset_adapter=adapter,
    completion_params=[
        {"model": "accounts/fireworks/models/kimi-k2-instruct", "temperature": 0.0},
        {"model": "accounts/fireworks/models/kimi-k2-instruct-0905", "temperature": 0.0},
        {"model": "accounts/fireworks/models/qwen3-235b-a22b-instruct-2507", "temperature": 0.0},
    ],
    num_runs=3,
    rollout_processor=processor,
)
After running the evaluation, analyze results using the Pivot View.
Pivot Table: avg of $.evaluation_result.score

Example leaderboard showing model performance comparison in the Pivot View.