> ## Documentation Index
> Fetch the complete documentation index at: https://evalprotocol.io/llms.txt
> Use this file to discover all available pages before exploring further.

# OpenAI RFT Trainer

> Reuse Eval Protocol evaluation tests as Python graders for OpenAI Reinforcement Fine-Tuning (RFT)

The OpenAI RFT adapter lets you **reuse Eval Protocol evaluation tests as Python graders** for OpenAI Reinforcement Fine-Tuning (RFT). Because your grading logic lives in an Eval Protocol `@evaluation_test`, you can reuse the exact same code as an OpenAI Python grader—making it easy to start with OpenAI RFT and later move to other Eval-Protocol supported training workflows (or vice versa) without rewriting your evals.

For a minimal working example, clone the [`openai-rft-quickstart`](https://github.com/eval-protocol/openai-rft-quickstart) repository, which contains the `example_rapidfuzz.py` and `test_openai_grader.py` files used in the examples below.

## High Level Overview

The core helper function lives in:

```python theme={null}
from eval_protocol.integrations.openai_rft import build_python_grader_from_evaluation_test
```

Under the hood, `build_python_grader_from_evaluation_test`:

* **Takes** your Eval Protocol `@evaluation_test` function that operates on an `EvaluationRow`.
* **Wraps** it into a self-contained `{"type": "python", "source": ...}` grader module with a `grade(sample, item)` entrypoint.
* **Builds** a minimal `EvaluationRow` from the OpenAI RFT inputs by:
  * Mapping `item["reference_answer"]` to `row.ground_truth`
  * Mapping `item["messages"]` (if present) to `row.messages`
  * Mapping `sample["output_text"]` to the last assistant message
* **Removes** any runtime dependency on `eval-protocol` inside the grader by using simple duck-typed stand-ins for `EvaluationRow`, `EvaluateResult`, and `Message`.
* **Normalizes** whatever your evaluation returns (e.g., `EvaluateResult`, `EvaluationRow` with `.evaluation_result`, or a bare number) into a single float score.

You can inspect the full implementation in [`eval_protocol/integrations/openai_rft.py`](https://github.com/eval-protocol/python-sdk/blob/main/eval_protocol/integrations/openai_rft.py).

## Grader Constraints

When you convert an `@evaluation_test` into an OpenAI Python grader, it must satisfy OpenAI’s runtime limits, i.e. no network access, fixed set of packages (e.g., `numpy`, `pandas`, `rapidfuzz`, etc.). For more details, see [OpenAI graders documentation](https://platform.openai.com/docs/guides/graders#technical-constraints).

## Basic Usage

### 1. Write an Eval Protocol `@evaluation_test`

In `example_rapidfuzz.py` (from the `openai-rft-quickstart` repo) we define a simple evaluation test that uses `rapidfuzz` to score how close a model’s answer is to the ground truth:

```python theme={null}
@evaluation_test(
    input_rows=[DEMO_ROWS],
    rollout_processor=NoOpRolloutProcessor(),
    aggregation_method="mean",
    mode="pointwise",
)
def rapidfuzz_eval(row: EvaluationRow, **kwargs: Any) -> EvaluationRow:
    """
    Example @evaluation_test that scores a row using rapidfuzz.WRatio and
    attaches an EvaluateResult.
    """
    from rapidfuzz import fuzz, utils

    # For EP evals, we compare the EvaluationRow's ground_truth to the last assistant message.
    reference = row.ground_truth

    assistant_msgs = [m for m in row.messages if m.role == "assistant"]
    last_assistant_content = assistant_msgs[-1].content if assistant_msgs else ""
    prediction = last_assistant_content if isinstance(last_assistant_content, str) else ""

    score = float(
        fuzz.WRatio(
            str(prediction),
            str(reference),
            processor=utils.default_process,
        )
        / 100.0
    )
    row.evaluation_result = EvaluateResult(score=score)
    return row
```

### 2. Convert to a Python grader and call `/graders/*`

In `test_openai_grader.py` (also in the `openai-rft-quickstart` repo) we show how to:

* Build a Python grader spec from `rapidfuzz_eval`
* Validate it via `/fine_tuning/alpha/graders/validate`
* Run it once via `/fine_tuning/alpha/graders/run`

```python theme={null}
import os
import requests

from eval_protocol.integrations.openai_rft import build_python_grader_from_evaluation_test
from examples.openai_rft.example_rapidfuzz import rapidfuzz_eval


api_key = os.environ["OPENAI_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}

grader = build_python_grader_from_evaluation_test(rapidfuzz_eval)  # {"type": "python", "source": "..."}

# validate the grader
resp = requests.post(
    "https://api.openai.com/v1/fine_tuning/alpha/graders/validate",
    json={"grader": grader},
    headers=headers,
)
print("validate response:", resp.text)

# run the grader once with a dummy item/sample
payload = {
    "grader": grader,
    "item": {"reference_answer": "fuzzy wuzzy had no hair"},
    "model_sample": "fuzzy wuzzy was a bear",
}
resp = requests.post(
    "https://api.openai.com/v1/fine_tuning/alpha/graders/run",
    json=payload,
    headers=headers,
)
print("run response:", resp.text)
```

## End-to-End Example

To see an end-to-end example that takes an `@evaluation_test` (`rapidfuzz_eval`), converts it into a `{"type": "python", "source": ...}` grader spec with `build_python_grader_from_evaluation_test`, and validates/runs it against the OpenAI `/graders/*` HTTP APIs, clone the quickstart repo and run:

```bash theme={null}
git clone git@github.com:eval-protocol/openai-rft-quickstart.git
cd openai-rft-quickstart

pytest example_rapidfuzz.py -vs  # Shows that this works as an EP evaluation_test

python test_openai_grader.py     # Validates and runs the Python grader via OpenAI's /graders/* APIs
```

You can expect an output like:

```bash theme={null}
validate response: {
  "grader": {
    "type": "python",
    "source": "def _ep_eval(row, **kwargs):\n    \"\"\"\n    Example @evaluation_test that scores a row using rapidfuzz.WRatio and\n    attaches an EvaluateResult.\n    \"\"\"\n    reference = row.ground_truth\n    assistant_msgs = [m for m in row.messages if m.role == 'assistant']\n    last_assistant_content = assistant_msgs[-1].content if assistant_msgs else ''\n    prediction = last_assistant_content if isinstance(last_assistant_content, str) else ''\n    from rapidfuzz import fuzz, utils\n    score = float(fuzz.WRatio(str(prediction), str(reference), processor=utils.default_process) / 100.0)\n    row.evaluation_result = EvaluateResult(score=score)\n    return row\n\n\nfrom typing import Any, Dict\nfrom types import SimpleNamespace\n\n\nclass EvaluationRow(SimpleNamespace):\n    \"\"\"Minimal duck-typed stand-in for an evaluation row.\n\n    Extend this with whatever attributes your eval logic uses.\n    \"\"\"\n    pass\n\n\nclass EvaluateResult(SimpleNamespace):\n    \"\"\"Simple stand-in for Eval Protocol's EvaluateResult.\n\n    This lets evaluation-style functions that construct EvaluateResult(score=...)\n    run inside the Python grader sandbox without importing eval_protocol.\n    \"\"\"\n\n    def __init__(self, score: float, **kwargs: Any) -> None:\n        super().__init__(score=score, **kwargs)\n\n\nclass Message(SimpleNamespace):\n    \"\"\"Duck-typed stand-in for eval_protocol.models.Message (role/content).\"\"\"\n    pass\n\n\ndef _build_row(sample: Dict[str, Any], item: Dict[str, Any]) -> EvaluationRow:\n    # Start from any item-provided messages (EP-style), defaulting to [].\n    raw_messages = item.get(\"messages\") or []\n    normalized_messages = []\n    for m in raw_messages:\n        if isinstance(m, dict):\n            normalized_messages.append(\n                Message(\n                    role=m.get(\"role\"),\n                    content=m.get(\"content\"),\n                )\n            )\n        else:\n            # Already Message-like; rely on duck typing (must have role/content)\n            normalized_messages.append(m)\n\n    reference = item.get(\"reference_answer\")\n    prediction = sample.get(\"output_text\")\n\n    # EP-style: ensure the model prediction is present as the last assistant message\n    if prediction is not None:\n        normalized_messages = list(normalized_messages)  # shallow copy\n        normalized_messages.append(Message(role=\"assistant\", content=prediction))\n\n    return EvaluationRow(\n        ground_truth=reference,\n        messages=normalized_messages,\n        item=item,\n        sample=sample,\n    )\n\n\ndef grade(sample: Dict[str, Any], item: Dict[str, Any]) -> float:\n    row = _build_row(sample, item)\n    result = _ep_eval(row=row)\n\n    # Try to normalize different result shapes into a float score\n    try:\n        from collections.abc import Mapping\n\n        if isinstance(result, (int, float)):\n            return float(result)\n\n        # EvaluateResult-like object with .score\n        if hasattr(result, \"score\"):\n            return float(result.score)\n\n        # EvaluationRow-like object with .evaluation_result.score\n        eval_res = getattr(result, \"evaluation_result\", None)\n        if eval_res is not None:\n            if isinstance(eval_res, Mapping):\n                if \"score\" in eval_res:\n                    return float(eval_res[\"score\"])\n            elif hasattr(eval_res, \"score\"):\n                return float(eval_res.score)\n\n        # Dict-like with score\n        if isinstance(result, Mapping) and \"score\" in result:\n            return float(result[\"score\"])\n    except Exception:\n        pass\n\n    return 0.0\n",
    "name": "grader-R5FhpA6BFQlo"
  }
}
run response: {
  "reward": 0.7555555555555555,
  "metadata": {
    "name": "grader-5XXSBZ9B1OJj",
    "type": "python",
    "errors": {
      "formula_parse_error": false,
      "sample_parse_error": false,
      "sample_parse_error_details": null,
      "truncated_observation_error": false,
      "unresponsive_reward_error": false,
      "invalid_variable_error": false,
      "invalid_variable_error_details": null,
      "other_error": false,
      "python_grader_server_error": false,
      "python_grader_server_error_type": null,
      "python_grader_runtime_error": false,
      "python_grader_runtime_error_details": null,
      "model_grader_server_error": false,
      "model_grader_refusal_error": false,
      "model_grader_refusal_error_details": null,
      "model_grader_parse_error": false,
      "model_grader_parse_error_details": null,
      "model_grader_exceeded_max_tokens_error": false,
      "model_grader_server_error_details": null,
      "endpoint_grader_internal_error": false,
      "endpoint_grader_internal_error_details": null,
      "endpoint_grader_server_error": false,
      "endpoint_grader_server_error_details": null,
      "endpoint_grader_safety_check_error": false
    },
    "execution_time": 6.831332206726074,
    "scores": {},
    "token_usage": null,
    "sampled_model_name": null
  },
  "sub_rewards": {},
  "model_grader_token_usage_per_model": {}
}
```

This confirms that:

* Your Eval Protocol `@evaluation_test` (`rapidfuzz_eval`) runs as a normal eval via `pytest`.
* The same function can be converted into a `type: "python"` grader spec and validated / run through the OpenAI RFT graders API.

Now that you have your grader, see OpenAI’s docs on [preparing your dataset and creating a reinforcement fine-tuning job](https://platform.openai.com/docs/guides/reinforcement-fine-tuning#prepare-your-dataset).
