@evaluation_test, which uses Eval Protocol’s rollout processor to generate trajectories, calls the same evaluation function you use for offline evals, and converts the result into rLLM’s abstractions. This makes it easy to start with rLLM and later move to other Eval-Protocol supported training workflows (or vice versa) without rewriting your evals.
For an end to end example, see the FrozenLake Eval Protocol example.
High Level Overview
The core integration lives in rLLM’sEvalProtocolWorkflow (implemented in rllm/workflows/eval_protocol_workflow.py):
EvalProtocolWorkflow:
- Takes an Eval Protocol
@evaluation_test(found via its module path, e.g."eval_protocol.benchmarks.test_frozen_lake"). - Reads the test’s metadata (attached by
@evaluation_test), including:rollout_processor(e.g.,MCPGymRolloutProcessor)server_script_path/mcp_config_path- rollout kwargs, mode, etc.
- Builds a rollout config combining:
- Eval Protocol metadata, and
- rLLM’s config (model id, temperature, max tokens, number of steps).
- Runs rollouts through Eval Protocol’s
rollout_processor, then calls the evaluation function (your@evaluation_test) to produce anEvaluationRowwith anevaluation_result. - Converts the resulting
EvaluationRowinto an rLLMEpisode/Trajectory/Step, attaching the final score and metrics.
Basic Usage
1. Define an Eval Protocol @evaluation_test
Start with a normal Eval Protocol test. For example, a FrozenLake environment that uses an MCP rollout processor:
test_frozen_lake.py
rollout_processor) and how to score (via the body of test_frozen_lake_evaluation).
2. Prepare a dataset for rLLM
On the rLLM side, you typically build a small dataset of task dicts thatEvalProtocolWorkflow can map into EvaluationRows. For FrozenLake, rLLM uses a script like:
prepare_frozen_lake_data.py
idsystem_promptuser_prompt_template(e.g., uses{observation})environment_context(whatever your Eval Protocol test expects)
EvaluationRow by EvalProtocolWorkflow’s _task_to_evaluation_row.
3. Run Eval Protocol tests through AgentWorkflowEngine
To run evals (no training), rLLM uses AgentWorkflowEngine with EvalProtocolWorkflow:
run_frozen_lake_flow.py
workflow_cls=EvalProtocolWorkflowtells rLLM to use the Eval Protocol adapter.env_path="eval_protocol.benchmarks.test_frozen_lake"points to the module containing your@evaluation_test.EvalProtocolWorkflowimports that module, finds the decorated test with its metadata, and wires everything together.
4. Train with AgentTrainer + EvalProtocolWorkflow
For reinforcement learning, rLLM plugs the same workflow into its trainer:
train_frozen_lake_flow.py
AgentTrainer:
- Uses
EvalProtocolWorkflowas its sampler/workflow. - Collects Episodes from Eval Protocol rollouts.
- Uses those Episodes as input to the underlying PPO/GRPO trainer.
End-to-End FrozenLake Example
To see this in action:- Clone the rLLM repository.
- Prepare the FrozenLake Eval Protocol dataset:
- Run the FrozenLake Eval Protocol workflow through rLLM:
- Start training:
- Change
env_pathto the module containing your@evaluation_test. - Prepare a matching dataset for rLLM (id, system prompt, user prompt template, environment context).
- Reuse
EvalProtocolWorkflowwithAgentWorkflowEngineand/orAgentTrainerto run or train on that environment.

