> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

# How to fine-tune your custom model

> Fine-tune a base model on your own data using Nebius Token Factory. This guide walks you through creating a job, monitoring it, and retrieving checkpoints via API and Python.

***

This guide shows you how to:

* Upload training (and optional validation) datasets
* Create a fine-tuning job via Python or cURL
* Monitor job status and events
* Download the resulting model checkpoints
* Understand the fine-tuning job API shape

***

## Prerequisites

1. Select a supported [**base model**](https://docs.tokenfactory.nebius.com/post-training/models) for **fine-tuning**.
2. [Create a dataset](https://docs.tokenfactory.nebius.com/post-training/datasets) for training.\
   Optionally, create a **validation** dataset as well.

   A typical split is:

   * **80–90%** of examples → training
   * **10–20%** of examples → validation
3. [Create an API key](https://docs.tokenfactory.nebius.com/api-reference/introduction#authentication).
4. Export the API key as an environment variable:

   ```bash theme={null}
   export NEBIUS_API_KEY=<YOUR_API_KEY>
   ```

***

## How to fine-tune a model

<Tabs>
  <Tab title="Python">
    ### 1. Install and import the client

    1. Install the `openai` Python SDK (Nebius exposes an OpenAI-compatible API):

       ```bash theme={null}
       pip3 install --upgrade openai
       ```
    2. Import libraries:

       ```python theme={null}
       import os
       import time
       from openai import OpenAI
       ```
    3. Initialize the Nebius client:

       ```python theme={null}
       client = OpenAI(
           base_url="https://api.tokenfactory.nebius.com/v1/",
           api_key=os.environ["NEBIUS_API_KEY"],
       )
       ```

    ***

    ### 2. Upload training (and optional validation) datasets

    If you already uploaded datasets via the UI or API, you can skip this and reuse their IDs.

    ```python theme={null}
    # Upload a training dataset
    training_file = client.files.create(
        file=open("training.jsonl", "rb"),
        purpose="fine-tune",
    )

    print("Training file ID:", training_file.id)

    # Optional: upload a validation dataset
    validation_file = client.files.create(
        file=open("validation.jsonl", "rb"),
        purpose="fine-tune",
    )

    print("Validation file ID:", validation_file.id)
    ```

    You only need the `id` fields from these responses to create a fine-tuning job.

    ***

    ### 3. Configure fine-tuning parameters

    For a full list of allowed fields and defaults, see [API specification for a fine-tuning job](#api-specification-for-a-fine-tuning-job).

    ```python theme={null}
    job_request = {
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "training_file": training_file.id,
        # Optional: only include this if you actually uploaded a validation file
        "validation_file": validation_file.id,
        "suffix": "my-domain-adapter",  # Optional, helps you identify this run
        "hyperparameters": {
            "batch_size": 8,
            "learning_rate": 1e-5,
            "n_epochs": 3,
            "warmup_ratio": 0.0,
            "weight_decay": 0.0,
            "lora": True,
            "lora_r": 16,
            "lora_alpha": 16,
            "lora_dropout": 0.05,
            "packing": True,
            "max_grad_norm": 1.0,
            # context_length is in tokens (e.g. default: 8192)
            "context_length": 8192,
        },
        "seed": 42,
        "integrations": [
            {
                "type": "wandb",
                "wandb": {
                    "project": "my-finetunes",
                    "name": "llama-8b-customer-support",
                    "entity": "my-team",
                    "tags": ["finetune", "llama-3.1", "support-bot"],
                },
            },
            {
                "type": "hf",
                "hf": {
                    "output_repo_name": "<repo>",  # e.g. "org/llama-8b-support-ft"
                    "api_token": "<token-value>",  # HF PAT with write access
                },
            },
        ],
    }
    ```

    ***

    ### 4. Create and run the fine-tuning job

    ```python theme={null}
    job = client.fine_tuning.jobs.create(**job_request)
    print("Created job:", job.id, "status:", job.status)
    ```

    ***

    ### 5. Poll job status

    Fine-tuning takes time. Poll the job until it reaches a terminal status:

    ```python theme={null}
    ACTIVE_STATUSES = ["succeeded", "failed", "cancelled"]
    POLL_INTERVAL_SECONDS = 15

    while job.status in ACTIVE_STATUSES:
        time.sleep(POLL_INTERVAL_SECONDS)
        job = client.fine_tuning.jobs.retrieve(job.id)
        print("Current status:", job.status)

    print("Final status:", job.status)
    print("Job ID:", job.id)

    if job.status == "failed":
        print("Job failed with error:", job.error)
    ```

    * If `job.status == "succeeded"`, training finished successfully.
    * If `job.status == "failed"`, inspect `job.error` for `code`, `message`, and `param`. For transient `5xx` errors, you can safely retry.

    ***

    ### 6. Inspect job events (optional but recommended)

    Events help you understand the lifecycle (file validation, dataset processing, training progress).

    ```python theme={null}
    if job.status == "succeeded":
        events = client.fine_tuning.jobs.list_events(job.id)
        for event in events.data:
            print(event.created_at, event.level, "-", event.message)
    ```

    You can consider training finished when you see messages like:

    * `Dataset processed successfully`
    * `Training completed successfully`

    ***

    ### 7. Download checkpoints and model files

    Each **checkpoint** represents the model after a certain number of training steps (often per epoch).

    ```python theme={null}
    if job.status == "succeeded":
        checkpoints = client.fine_tuning.jobs.checkpoints.list(job.id).data

        for checkpoint in checkpoints:
            print("Checkpoint:", checkpoint.id, "step:", checkpoint.step_number)
            os.makedirs(checkpoint.id, exist_ok=True)

            for file_id in checkpoint.result_files:
                # Get file metadata
                file_obj = client.files.retrieve(file_id)
                filename = file_obj.filename  # e.g. "<checkpoint_ID>/adapter_config.json"

                # Download file contents
                file_content = client.files.content(file_id)

                # Save to disk with the same filename
                output_path = os.path.join(checkpoint.id, os.path.basename(filename))
                file_content.write_to_file(output_path)
                print("Saved:", output_path)
    ```

    You now have:

    * Intermediate checkpoints (per step / epoch)
    * Final checkpoint (usually the last one in the list)

    Use the **latest checkpoint** for deployment unless you have a specific reason to pick an earlier one.

    You can now [deploy your fine-tuned model](https://docs.tokenfactory.nebius.com/post-training/deploy-custom-model#how-to-deploy-a-model-fine-tuned-in-nebius-token-factory) and serve it via Nebius Token Factory.
  </Tab>

  <Tab title="cURL">
    ### 1. Upload datasets

    Upload the **training** dataset:

    ```bash theme={null}
    curl 'https://api.tokenfactory.nebius.com/v1/files' \
      -H 'Accept: application/json' \
      -H 'Content-Type: multipart/form-data' \
      -H "Authorization: Bearer $NEBIUS_API_KEY" \
      -F 'file=@training.jsonl' \
      -F 'purpose=fine-tune'
    ```

    <Accordion title="Response example">
      ```json theme={null}
      {
        "id": "<file_ID>",
        "bytes": 700867,
        "created_at": 1738235422,
        "filename": "training.jsonl",
        "object": "file",
        "purpose": "fine-tune"
      }
      ```
    </Accordion>

    Save the `id` as your `training_file`.

    Optionally, upload a **validation** dataset in the same way and save its `id` as `validation_file`.

    ***

    ### 2. Create a fine-tuning job

    ```bash theme={null}
    curl 'https://api.tokenfactory.nebius.com/v1/fine_tuning/jobs' \
      -X POST \
      -H 'Accept: application/json' \
      -H 'Content-Type: application/json' \
      -H "Authorization: Bearer $NEBIUS_API_KEY" \
      -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "suffix": "my-domain-adapter",
        "training_file": "<training_file_ID>",
        "validation_file": "<validation_file_ID>",
        "hyperparameters": {
          "batch_size": 8,
          "learning_rate": 0.00001,
          "n_epochs": 3,
          "warmup_ratio": 0.0,
          "weight_decay": 0.0,
          "lora": true,
          "lora_r": 16,
          "lora_alpha": 16,
          "lora_dropout": 0.05,
          "packing": true,
          "max_grad_norm": 1.0,
          "context_length": 8192
        },
        "seed": 42,
        "integrations": [
          {
            "type": "wandb",
            "wandb": {
              "project": "my-finetunes",
              "name": "llama-8b-customer-support",
              "entity": "my-team",
              "tags": ["finetune", "llama-3.1", "support-bot"]
            }
          }
        ]
      }'
    ```

    You can now [deploy the produced model files](https://docs.tokenfactory.nebius.com/fine-tuning/deploy-custom-model#how-to-deploy-a-model-fine-tuned-in-nebius-token-factory) when the job succeeds.

    ***

    ### 3. Check job status

    ```bash theme={null}
    curl "https://api.tokenfactory.nebius.com/v1/fine_tuning/jobs/<job_ID>" \
      -X GET \
      -H "Accept: application/json" \
      -H "Authorization: Bearer $NEBIUS_API_KEY"
    ```

    <Accordion title="200 OK response example">
      ```json theme={null}
      {
        "id": "<job_ID>",
        "created_at": 1738250578,
        "error": null,
        "finished_at": null,
        "hyperparameters": {
          "batch_size": 8,
          "learning_rate": 0.00001,
          "n_epochs": 3,
          "warmup_ratio": 0,
          "weight_decay": 0,
          "lora": true,
          "lora_r": 16,
          "lora_alpha": 16,
          "lora_dropout": 0.05,
          "packing": true,
          "max_grad_norm": 1,
          "context_length": 8192
        },
        "model": "<model_name>",
        "object": "fine_tuning.job",
        "organization_id": "",
        "result_files": [],
        "seed": 0,
        "status": "running",
        "trained_tokens": 0,
        "training_file": "<file_ID>",
        "validation_file": "<file_ID>",
        "estimated_finish": null,
        "suffix": "",
        "trained_steps": 0,
        "total_steps": 100
      }
      ```
    </Accordion>

    * Poll this endpoint every **≥ 15 seconds**.
    * Terminal statuses: `succeeded`, `failed`.
    * If `status` is `failed`, inspect the `error` object. For transient server errors (`5xx`), you can recreate the job.

    ***

    ### 4. Inspect job events

    ```bash theme={null}
    curl "https://api.tokenfactory.nebius.com/v1/fine_tuning/jobs/<job_ID>/events" \
      -X GET \
      -H "Accept: application/json" \
      -H "Authorization: Bearer $NEBIUS_API_KEY" \
      --url-query limit=50
    ```

    Supported query parameters:

    * `limit` (integer, optional): Max number of events to return.
    * `after` (string, optional): Event ID to paginate after.

    You can consider training completed when you see messages like:

    * `Dataset '<file_ID>' processed successfully`
    * `Training completed successfully`

    <Accordion title="200 OK response example">
      ```json theme={null}
      {
        "data": [
          {
            "object": "fine_tuning.job.event",
            "id": "<event_ID>",
            "created_at": 1738250578,
            "level": "info",
            "message": "Job is submitted",
            "source": "api",
            "job_uuid": "<job_ID>"
          },
          {
            "object": "fine_tuning.job.event",
            "id": "<event_ID>",
            "created_at": 1738250609,
            "level": "info",
            "message": "Dataset '<file_ID>' processed successfully",
            "source": "datasets",
            "job_uuid": "<job_ID>"
          }
        ],
        "has_more": false
      }
      ```
    </Accordion>

    ***

    ### 5. List checkpoints

    ```bash theme={null}
    curl "https://api.tokenfactory.nebius.com/v1/fine_tuning/jobs/<job_ID>/checkpoints" \
      -X GET \
      -H "Accept: application/json" \
      -H "Authorization: Bearer $NEBIUS_API_KEY"
    ```

    <Accordion title="200 OK response example">
      ```json theme={null}
      {
        "object": "list",
        "data": [
          {
            "id": "<checkpoint_ID>",
            "created_at": 1740501233,
            "fine_tuned_model_checkpoint": "ft:meta-llama/Llama-3.1-8B-Instruct-2025-02-25:org_placeholder::IDPlaceholder:ckpt-step-3",
            "fine_tuning_job_id": "<job_ID>",
            "metrics": {
              "train_loss": 2.01,
              "valid_loss": 2.32
            },
            "object": "fine_tuning.job.checkpoint",
            "step_number": 3,
            "result_files": [
              "<file_ID_1>",
              "<file_ID_2>",
              "<file_ID_3>"
            ]
          }
        ],
        "first_id": "<first_checkpoint_ID>",
        "last_id": "<last_checkpoint_ID>",
        "has_more": false
      }
      ```
    </Accordion>

    The `result_files` array contains IDs of all files that belong to this checkpoint.

    ***

    ### 6. Inspect file metadata

    ```bash theme={null}
    curl "https://api.tokenfactory.nebius.com/v1/files/<file_ID>" \
      -X GET \
      -H "Accept: application/json" \
      -H "Authorization: Bearer $NEBIUS_API_KEY"
    ```

    <Accordion title="200 OK response example">
      ```json theme={null}
      {
        "id": "<file_ID>",
        "bytes": 907,
        "created_at": 1740501244,
        "filename": "<checkpoint_ID>/adapter_config.json",
        "object": "file",
        "purpose": "fine-tune"
      }
      ```
    </Accordion>

    Use the `filename` field to save the file with the correct path and extension.

    ***

    ### 7. Download file contents

    ```bash theme={null}
    curl "https://api.tokenfactory.nebius.com/v1/files/<file_ID>/content" \
      -X GET \
      -H "Accept: application/json" \
      -H "Authorization: Bearer $NEBIUS_API_KEY"
    ```

    Copy the content from the response and save it locally with the name from the `filename` field you retrieved in the previous step.

    Once you download the required checkpoint files, you can [host the fine-tuned model](https://docs.tokenfactory.nebius.com/fine-tuning/deploy-custom-model#how-to-deploy-a-model-fine-tuned-in-nebius-token-factory) and serve it via Nebius Token Factory.
  </Tab>
</Tabs>

***

## API specification for a fine-tuning job

This section describes the **request payload** when creating a fine-tuning job.

```json theme={null}
{
  "model": "<string>",
  "suffix": "<string>",
  "training_file": "<file_ID>",
  "validation_file": "<file_ID>",
  "hyperparameters": {
    "batch_size": 8,
    "learning_rate": 0.00001,
    "n_epochs": 3,
    "warmup_ratio": 0,
    "weight_decay": 0,
    "lora": false,
    "lora_r": 8,
    "lora_alpha": 8,
    "lora_dropout": 0,
    "packing": true,
    "max_grad_norm": 1,
    "context_length": 8192
  },
  "seed": 42,
   "integrations": [
    {
      "type": "wandb",
      "wandb": {
        "project": "<string>",
        "name": "<string>",
        "entity": "<string>",
        "tags": ["<string>"]
      }
    },
    {
      "type": "hf",
      "hf": {
        "output_repo_name": "<string>", 
        "api_token": "<string>"
      }
    }
  ]
}
```

### Top-level fields

* `model` (string, **required**) Base [model](https://docs.tokenfactory.nebius.com/post-training/models) to fine-tune.
* `suffix` (string, optional) Human-readable suffix appended to the model name. Use this to distinguish multiple runs, e.g., `customer-support-v1`.
* `training_file` (string, **required**) ID of the file with the training dataset (`purpose = "fine-tune"`). See:
  * [How to create a dataset for fine-tuning](https://docs.tokenfactory.nebius.com/post-training/datasets)
  * [How to fine-tune a model](https://docs.tokenfactory.nebius.com/post-training/how-to-fine-tune)
* `validation_file` (string, optional) ID of the file with the validation dataset. Same format and requirements as the training dataset.
* `hyperparameters` (object, optional) Fine-tuning configuration. Omitted fields fall back to defaults.
* `seed` (integer, optional) Random seed used during training. Using the same `seed` and the same data/hyperparameters improves reproducibility between runs.
* `integrations` (array, optional) Third-party integrations configured for this job.
  * `type` (string, required) Currently supported: `"wandb"`.
* `wandb` (object, required when `type = "wandb"`) Settings for exporting metrics to \
  \
  [Weights & Biases](https://wandb.ai/):

  * `project` (string, required): W\&B project name.
  * `name` (string, optional): Run name.
  * `entity` (string, optional): W\&B entity (user or team).
  * `tags` (array of strings, optional): Tags to attach to the run.

  [**Hugging Face integration**](https://huggingface.co/): **hf** (object, required when `type = "hf"`)

  * `output_repo_name` (string, required):\
    Target Hugging Face repo name, e.g. `"org/llama-8b-support-ft"` or `"username/my-finetune"`.
  * `api_token` (string, required):\
    Hugging Face access token (PAT) with write access to `output_repo_name`.

***

### Hyperparameters

All hyperparameters are nested under `hyperparameters`.

* `batch_size` (integer, optional) Number of examples per training batch. Larger batch sizes are more efficient but require more VRAM.
  * Typical range: `8`–`32`
  * Default: `8`
* `context_length` (integer, optional) Maximum sequence length in tokens used during fine-tuning. Inputs longer than this limit will cause errors.

  * Units: tokens (e.g., `8192`)
  * Supported values depend on the base model; see the [models](https://docs.tokenfactory.nebius.com/fine-tuning/models) page.
  * Default: `8192`

  We recommend:

  * Analyze the token length distribution of your dataset.
  * Choose the smallest context length that covers your P95–P99 examples.
  * If `packing = false`, a much larger context length choice than your examples leads to heavy padding and wasted compute.

  <Tip>
    Larger context lengths significantly increase VRAM usage and FLOPs due to attention scaling.
  </Tip>
* `learning_rate` (float, optional) Step size for gradient descent.
  * Must be `>= 0`
  * Typical values: `1e-6`–`5e-5`
  * Default: `0.00001`
* `n_epochs` (integer, optional) Number of passes over the entire dataset.

  * Range: `1`–`20`
  * Default: `3`

  More epochs increase task specialization but also overfitting risk.
* `warmup_ratio` (float, optional) Fraction of total training steps used for linear warmup of the learning rate from 0 to the target value.
  * Range: `0`–`1`
  * Default: `0`
* `weight_decay` (float, optional) L2 regularization factor applied to weights. Helps prevent overfitting and preserve generalization.
  * Must be `>= 0`
  * Default: `0`
* `lora` (boolean, optional) Whether to use **LoRA** (Low-Rank Adaptation) instead of full-parameter fine-tuning.
  * `true`: only LoRA adapter weights are trained; base model weights stay frozen.
  * `false`: full fine-tuning is applied.
  * Default: `false`
* `lora_r` (integer, optional) Rank of LoRA matrices. Higher values increase capacity but also overfitting and cost.
  * Range: `8`–`128`
  * Default: `8`
* `lora_alpha` (integer, optional) Scaling factor for LoRA updates. Higher values increase the impact of LoRA adapters.
  * Must be `>= 8`
  * Default: `8`
* `lora_dropout` (float, optional) Dropout applied to LoRA layers. Helps prevent overfitting, especially on small datasets.
  * Range: `0`–`1`
  * Default: `0`
* `packing` (boolean, optional) If `true`, multiple shorter samples can be packed into a single sequence to better utilize the context window and improve efficiency.
  * Default: `true`
* `max_grad_norm` (float, optional) Gradient clipping threshold (L2 norm). Avoids unstable updates:
  * Too **high** → effectively no clipping → risk of **exploding gradients**.
  * Too **low** → overly aggressive clipping → risk of **under-training**.
  * Must be `>= 0`
  * Default: `1`

***

### Fine-tuning job object (response shape)

When you query a job or list jobs, you get objects shaped like this:

```json theme={null}
{
  "data": [
    {
      "id": "<string>",
      "created_at": 123,
      "hyperparameters": {
        "batch_size": 8,
        "learning_rate": 0.00001,
        "n_epochs": 3,
        "warmup_ratio": 0,
        "weight_decay": 0,
        "lora": false,
        "lora_r": 8,
        "lora_alpha": 8,
        "lora_dropout": 0,
        "packing": true,
        "max_grad_norm": 1,
        "context_length": 8192
      },
      "model": "<string>",
      "status": "validating_files",
      "training_file": "<string>",
      "error": {
        "code": "<string>",
        "message": "<string>",
        "param": "<string>"
      },
      "finished_at": 123,
     "integrations": [
        {
          "wandb": {
            "project": "<string>",
            "name": "<string>",
            "entity": "<string>",
            "tags": ["<string>"]
          },
          "type": "wandb"
        },
        {
          "hf": {
            "output_repo_name": "<string>",
            "api_token": "<string>"
          },
          "type": "hf"
        }
      ],
      "object": "fine_tuning.job",
      "organization_id": "",
      "result_files": [],
      "seed": 0,
      "suffix": "<string>",
      "trained_tokens": 123,
      "validation_file": "<string>",
      "estimated_finish": 123,
      "trained_steps": 123,
      "total_steps": 123
    }
  ],
  "has_more": true,
  "object": "list"
}
```

Key fields to watch during a run:

* `status`: `validating_files` → `queued` → `running` → `succeeded` / `failed`
* `trained_tokens`: how many tokens have been processed so far
* `trained_steps` / `total_steps`: progress of the training loop
* `error`: structured error info when `status = "failed"`
* `result_files`: IDs of produced artifacts (also available via checkpoints API)

Use these fields plus job events to drive your own monitoring, dashboards, or CI/CD automation around fine-tuning.
