Skip to main content
The OpenAI RFT adapter lets you reuse Eval Protocol evaluation tests as Python graders for OpenAI Reinforcement Fine-Tuning (RFT). Because your grading logic lives in an Eval Protocol @evaluation_test, you can reuse the exact same code as an OpenAI Python grader—making it easy to start with OpenAI RFT and later move to other Eval-Protocol supported training workflows (or vice versa) without rewriting your evals. For a minimal working example, clone the openai-rft-quickstart repository, which contains the example_rapidfuzz.py and test_openai_grader.py files used in the examples below.

High Level Overview

The core helper function lives in:
from eval_protocol.integrations.openai_rft import build_python_grader_from_evaluation_test
Under the hood, build_python_grader_from_evaluation_test:
  • Takes your Eval Protocol @evaluation_test function that operates on an EvaluationRow.
  • Wraps it into a self-contained {"type": "python", "source": ...} grader module with a grade(sample, item) entrypoint.
  • Builds a minimal EvaluationRow from the OpenAI RFT inputs by:
    • Mapping item["reference_answer"] to row.ground_truth
    • Mapping item["messages"] (if present) to row.messages
    • Mapping sample["output_text"] to the last assistant message
  • Removes any runtime dependency on eval-protocol inside the grader by using simple duck-typed stand-ins for EvaluationRow, EvaluateResult, and Message.
  • Normalizes whatever your evaluation returns (e.g., EvaluateResult, EvaluationRow with .evaluation_result, or a bare number) into a single float score.
You can inspect the full implementation in eval_protocol/integrations/openai_rft.py.

Grader Constraints

When you convert an @evaluation_test into an OpenAI Python grader, it must satisfy OpenAI’s runtime limits, i.e. no network access, fixed set of packages (e.g., numpy, pandas, rapidfuzz, etc.). For more details, see OpenAI graders documentation.

Basic Usage

1. Write an Eval Protocol @evaluation_test

In example_rapidfuzz.py (from the openai-rft-quickstart repo) we define a simple evaluation test that uses rapidfuzz to score how close a model’s answer is to the ground truth:
@evaluation_test(
    input_rows=[DEMO_ROWS],
    rollout_processor=NoOpRolloutProcessor(),
    aggregation_method="mean",
    mode="pointwise",
)
def rapidfuzz_eval(row: EvaluationRow, **kwargs: Any) -> EvaluationRow:
    """
    Example @evaluation_test that scores a row using rapidfuzz.WRatio and
    attaches an EvaluateResult.
    """
    from rapidfuzz import fuzz, utils

    # For EP evals, we compare the EvaluationRow's ground_truth to the last assistant message.
    reference = row.ground_truth

    assistant_msgs = [m for m in row.messages if m.role == "assistant"]
    last_assistant_content = assistant_msgs[-1].content if assistant_msgs else ""
    prediction = last_assistant_content if isinstance(last_assistant_content, str) else ""

    score = float(
        fuzz.WRatio(
            str(prediction),
            str(reference),
            processor=utils.default_process,
        )
        / 100.0
    )
    row.evaluation_result = EvaluateResult(score=score)
    return row

2. Convert to a Python grader and call /graders/*

In test_openai_grader.py (also in the openai-rft-quickstart repo) we show how to:
  • Build a Python grader spec from rapidfuzz_eval
  • Validate it via /fine_tuning/alpha/graders/validate
  • Run it once via /fine_tuning/alpha/graders/run
import os
import requests

from eval_protocol.integrations.openai_rft import build_python_grader_from_evaluation_test
from examples.openai_rft.example_rapidfuzz import rapidfuzz_eval


api_key = os.environ["OPENAI_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}

grader = build_python_grader_from_evaluation_test(rapidfuzz_eval)  # {"type": "python", "source": "..."}

# validate the grader
resp = requests.post(
    "https://api.openai.com/v1/fine_tuning/alpha/graders/validate",
    json={"grader": grader},
    headers=headers,
)
print("validate response:", resp.text)

# run the grader once with a dummy item/sample
payload = {
    "grader": grader,
    "item": {"reference_answer": "fuzzy wuzzy had no hair"},
    "model_sample": "fuzzy wuzzy was a bear",
}
resp = requests.post(
    "https://api.openai.com/v1/fine_tuning/alpha/graders/run",
    json=payload,
    headers=headers,
)
print("run response:", resp.text)

End-to-End Example

To see an end-to-end example that takes an @evaluation_test (rapidfuzz_eval), converts it into a {"type": "python", "source": ...} grader spec with build_python_grader_from_evaluation_test, and validates/runs it against the OpenAI /graders/* HTTP APIs, clone the quickstart repo and run:
git clone git@github.com:eval-protocol/openai-rft-quickstart.git
cd openai-rft-quickstart

pytest example_rapidfuzz.py -vs  # Shows that this works as an EP evaluation_test

python test_openai_grader.py     # Validates and runs the Python grader via OpenAI's /graders/* APIs
You can expect an output like:
validate response: {
  "grader": {
    "type": "python",
    "source": "def _ep_eval(row, **kwargs):\n    \"\"\"\n    Example @evaluation_test that scores a row using rapidfuzz.WRatio and\n    attaches an EvaluateResult.\n    \"\"\"\n    reference = row.ground_truth\n    assistant_msgs = [m for m in row.messages if m.role == 'assistant']\n    last_assistant_content = assistant_msgs[-1].content if assistant_msgs else ''\n    prediction = last_assistant_content if isinstance(last_assistant_content, str) else ''\n    from rapidfuzz import fuzz, utils\n    score = float(fuzz.WRatio(str(prediction), str(reference), processor=utils.default_process) / 100.0)\n    row.evaluation_result = EvaluateResult(score=score)\n    return row\n\n\nfrom typing import Any, Dict\nfrom types import SimpleNamespace\n\n\nclass EvaluationRow(SimpleNamespace):\n    \"\"\"Minimal duck-typed stand-in for an evaluation row.\n\n    Extend this with whatever attributes your eval logic uses.\n    \"\"\"\n    pass\n\n\nclass EvaluateResult(SimpleNamespace):\n    \"\"\"Simple stand-in for Eval Protocol's EvaluateResult.\n\n    This lets evaluation-style functions that construct EvaluateResult(score=...)\n    run inside the Python grader sandbox without importing eval_protocol.\n    \"\"\"\n\n    def __init__(self, score: float, **kwargs: Any) -> None:\n        super().__init__(score=score, **kwargs)\n\n\nclass Message(SimpleNamespace):\n    \"\"\"Duck-typed stand-in for eval_protocol.models.Message (role/content).\"\"\"\n    pass\n\n\ndef _build_row(sample: Dict[str, Any], item: Dict[str, Any]) -> EvaluationRow:\n    # Start from any item-provided messages (EP-style), defaulting to [].\n    raw_messages = item.get(\"messages\") or []\n    normalized_messages = []\n    for m in raw_messages:\n        if isinstance(m, dict):\n            normalized_messages.append(\n                Message(\n                    role=m.get(\"role\"),\n                    content=m.get(\"content\"),\n                )\n            )\n        else:\n            # Already Message-like; rely on duck typing (must have role/content)\n            normalized_messages.append(m)\n\n    reference = item.get(\"reference_answer\")\n    prediction = sample.get(\"output_text\")\n\n    # EP-style: ensure the model prediction is present as the last assistant message\n    if prediction is not None:\n        normalized_messages = list(normalized_messages)  # shallow copy\n        normalized_messages.append(Message(role=\"assistant\", content=prediction))\n\n    return EvaluationRow(\n        ground_truth=reference,\n        messages=normalized_messages,\n        item=item,\n        sample=sample,\n    )\n\n\ndef grade(sample: Dict[str, Any], item: Dict[str, Any]) -> float:\n    row = _build_row(sample, item)\n    result = _ep_eval(row=row)\n\n    # Try to normalize different result shapes into a float score\n    try:\n        from collections.abc import Mapping\n\n        if isinstance(result, (int, float)):\n            return float(result)\n\n        # EvaluateResult-like object with .score\n        if hasattr(result, \"score\"):\n            return float(result.score)\n\n        # EvaluationRow-like object with .evaluation_result.score\n        eval_res = getattr(result, \"evaluation_result\", None)\n        if eval_res is not None:\n            if isinstance(eval_res, Mapping):\n                if \"score\" in eval_res:\n                    return float(eval_res[\"score\"])\n            elif hasattr(eval_res, \"score\"):\n                return float(eval_res.score)\n\n        # Dict-like with score\n        if isinstance(result, Mapping) and \"score\" in result:\n            return float(result[\"score\"])\n    except Exception:\n        pass\n\n    return 0.0\n",
    "name": "grader-R5FhpA6BFQlo"
  }
}
run response: {
  "reward": 0.7555555555555555,
  "metadata": {
    "name": "grader-5XXSBZ9B1OJj",
    "type": "python",
    "errors": {
      "formula_parse_error": false,
      "sample_parse_error": false,
      "sample_parse_error_details": null,
      "truncated_observation_error": false,
      "unresponsive_reward_error": false,
      "invalid_variable_error": false,
      "invalid_variable_error_details": null,
      "other_error": false,
      "python_grader_server_error": false,
      "python_grader_server_error_type": null,
      "python_grader_runtime_error": false,
      "python_grader_runtime_error_details": null,
      "model_grader_server_error": false,
      "model_grader_refusal_error": false,
      "model_grader_refusal_error_details": null,
      "model_grader_parse_error": false,
      "model_grader_parse_error_details": null,
      "model_grader_exceeded_max_tokens_error": false,
      "model_grader_server_error_details": null,
      "endpoint_grader_internal_error": false,
      "endpoint_grader_internal_error_details": null,
      "endpoint_grader_server_error": false,
      "endpoint_grader_server_error_details": null,
      "endpoint_grader_safety_check_error": false
    },
    "execution_time": 6.831332206726074,
    "scores": {},
    "token_usage": null,
    "sampled_model_name": null
  },
  "sub_rewards": {},
  "model_grader_token_usage_per_model": {}
}
This confirms that:
  • Your Eval Protocol @evaluation_test (rapidfuzz_eval) runs as a normal eval via pytest.
  • The same function can be converted into a type: "python" grader spec and validated / run through the OpenAI RFT graders API.
Now that you have your grader, see OpenAI’s docs on preparing your dataset and creating a reinforcement fine-tuning job.