> ## Documentation Index
> Fetch the complete documentation index at: https://evalprotocol.io/llms.txt
> Use this file to discover all available pages before exploring further.

# GSM8K Fine-tuning Quickstart (Small Model)

> Run pytest to materialize the evaluator and dataset, then launch a local Reinforcement Fine-Tuning job on a small model.

Run GSM8K locally end-to-end:

* Materialize a GSM8K evaluator and dataset with `pytest`
* Kick off a Reinforcement Fine-Tuning (RFT) job for a small base model
* Track accuracy improvements by re-running the evaluator

<Note type="Tip" title="Run in Colab (Billing Required)">
  Running the GSM8K tutorial in Google Colab requires a Google account with billing enabled (credit card on file). Fireworks usage also bills against your account once you supply `FIREWORKS_API_KEY`. 👉 [Run the GSM8K Fine-tuning Colab](https://colab.research.google.com/drive/16xrb9rx6AoAEOtrDXumzo71HjhunaoPi#scrollTo=CP18QX4tgi-0)
</Note>

## Prerequisites

* Python 3.10+
* Local Python environment with Jupyter support (VS Code, JupyterLab, or classic notebook)
* `FIREWORKS_API_KEY` with permissions to launch RFT jobs (stored in your shell or `.env`)
* Basic familiarity with GSM8K-style math reasoning tasks

<Steps>
  <Step title="Install dependencies">
    Install the latest `eval-protocol` SDK directly from the main branch and make sure `pytest` is on the path. Upgrade `pip` first to avoid resolver issues.

    ```bash theme={null}
    python -m pip install --upgrade pip
    python -m pip install pytest git+https://github.com/eval-protocol/python-sdk.git
    ```
  </Step>

  <Step title="Download evaluation assets">
    Download the evaluation assets we will use to kick off the job. Copy the GSM8K pytest script and sample dataset into a working directory (here `gsm8k_artifacts/`). The snippet below is safe to run inside a notebook cell or standalone script and ensures the files land where later steps expect them.

    ```python tutorial/download_gsm8k_assets.py theme={null}
    from pathlib import Path
    import requests

    ARTIFACT_ROOT = Path("gsm8k_artifacts")
    TEST_PATH = ARTIFACT_ROOT / "tests" / "pytest" / "gsm8k" / "test_pytest_math_example.py"
    DATASET_PATH = ARTIFACT_ROOT / "development" / "gsm8k_sample.jsonl"

    files_to_download = {
        TEST_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/tests/pytest/gsm8k/test_pytest_math_example.py",
        DATASET_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/development/gsm8k_sample.jsonl",
    }

    for local_path, url in files_to_download.items():
        local_path.parent.mkdir(parents=True, exist_ok=True)
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        local_path.write_bytes(response.content)
        print(f"Saved {url} -> {local_path}")
    ```

    Expected output:

    ```
    Saved https://raw.githubusercontent.com/.../test_pytest_math_example.py -> gsm8k_artifacts/tests/pytest/gsm8k/test_pytest_math_example.py
    Saved https://raw.githubusercontent.com/.../gsm8k_sample.jsonl -> gsm8k_artifacts/development/gsm8k_sample.jsonl
    ```
  </Step>

  <Step title="Run the GSM8K evaluation locally">
    Execute the evaluation that materializes the evaluator and dataset. Point the test at the artifacts folder you created in the previous step.

    ```bash theme={null}
    cd gsm8k_artifacts
    ep local-test
    ```

    This command discovers and runs your `@evaluation_test` with pytest.

    You should see log output for each rollout and navigate to [http://localhost:8000](http://localhost:8000) to see the Eval Protocol UI and inspect results.

    <img src="https://mintcdn.com/fireworksai-staging/1UT0LIye1j-4_g9o/assets/gsm8k-local-eval.png?fit=max&auto=format&n=1UT0LIye1j-4_g9o&q=85&s=6068ca265d01b89fe7a2a9362cd8a6b5" alt="Screenshot of the local GSM8K evaluation UI showing aggregate scores and trajectories panels." width="1372" height="932" data-path="assets/gsm8k-local-eval.png" />
  </Step>

  <Step title="Set FIREWORKS_API_KEY">
    Store your Fireworks API key in the environment so the CLI can authenticate. The command below keeps the key confined to the current shell session.

    ```bash theme={null}
    export FIREWORKS_API_KEY="<your-fireworks-key>"
    ```

    Alternatively, load it from a secrets manager or `.env` file if your workflow already manages credentials securely.
  </Step>

  <Step title="Launch the RFT job">
    Trigger the `eval-protocol` CLI to start a Reinforcement Fine-Tuning job using the evaluator and dataset registered above. Replace the base model to experiment with other policies.

    ```bash theme={null}
    cd ..
    eval-protocol create rft --base-model accounts/fireworks/models/qwen3-0p6b
    ```

    The CLI reports dashboard links for the evaluator, dataset, and RFT job so you can monitor rollouts.

    <img src="https://mintcdn.com/fireworksai-staging/1UT0LIye1j-4_g9o/assets/gsm8k-rft-final.png?fit=max&auto=format&n=1UT0LIye1j-4_g9o&q=85&s=bad7e10ee350647b199d5f15d2b3f455" alt="Screenshot of the Fireworks RFT dashboard highlighting final accuracy metrics for the GSM8K job." width="1090" height="479" data-path="assets/gsm8k-rft-final.png" />
  </Step>
</Steps>

## Track accuracy over time

* Re-run the `ep local-test` command periodically to evaluate the latest checkpoint against the GSM8K slice.
* Adjust reward shaping or parsing logic inside `test_pytest_math_example.py` to fit your formatting expectations.
* Swap in a custom dataset JSONL by editing the local artifact or passing `--dataset-jsonl` when creating the RFT job.

## What’s happening under the hood

* The evaluation tests a small GSM8K slice with a numeric-check reward and registers an evaluator plus dataset with your local API.
* The `create rft` command wires those resources into a Reinforcement Fine-Tuning job for the specified base model.
* As training progresses, evaluation scores reflect improved accuracy on the held-out set, letting you iterate quickly before scaling up.

## Next steps

* Parameterize base models and dataset paths in scripts or notebooks to make repeated experiments easier.
* Automate the evaluation loop in CI so new policies are validated before deployment.
* Promote successful evaluators and datasets to shared registries once the workflow is stable.

Ready to try your own? For a ready-to-run project with all files included, clone the [GSM8K Quickstart Repository](https://github.com/eval-protocol/quickstart-gsm8k) and get started now!
