> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

# custom-speculator

Use the Custom Speculator API to train a speculative decoding drafter for a supported base model in Token Factory. A custom speculator job produces a drafter artifact that can be served with vLLM for speculative decoding.

## Create a custom speculator job

Create a fine-tuning job with `method.type="spec-draft"` and provide speculator hyperparameters under `method.spec_draft.hyperparameters`.

```python theme={null}
from openai import AsyncOpenAI

job_request = {
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "training_file": training_file,
    "method": {
        "type": "spec-draft",
        "spec_draft": {
            "hyperparameters": {
                "batch_size": 8,
                "learning_rate": 0.0075,
                "n_epochs": 1,
                "warmup_ratio": 0.1,
                "max_grad_norm": 1.0,
                "context_length": 8192,
                "loss": {"type": "kl"},
                "num_decoding_heads": 3,
                "architecture": "eagle3_original",
            },
        },
    },
    "seed": 2214,
    "suffix": "specdec-example",
}

client = AsyncOpenAI(
    base_url="<base_url>",
    api_key="<token>",
)

job = await client.fine_tuning.jobs.create(**job_request)
```

## Dataset format

Only pretokenized datasets are supported. Upload a JSONL file where each line contains:

* `input_ids` — array of token IDs.
* `attention_mask` — array of `1`s and `0`s aligned with `input_ids`.
* `labels` — array of target token IDs. Use `-100` to ignore a position in the loss.

```jsonl theme={null}
{"input_ids": [1, 14, 52, 16], "attention_mask": [1, 1, 1, 1], "labels": [-100, -100, 52, 16]}
{"input_ids": [160, 34, 129, 10432], "attention_mask": [1, 1, 1, 1], "labels": [160, 34, 129, 10432]}
```

## Request fields

### Top-level fields

| Field           | Type    | Required | Description                                                |
| --------------- | ------- | -------- | ---------------------------------------------------------- |
| `model`         | string  | Yes      | Base model to train the speculator against.                |
| `training_file` | string  | Yes      | Training dataset file ID.                                  |
| `method`        | object  | Yes      | Fine-tuning method configuration. Use `type="spec-draft"`. |
| `seed`          | integer | No       | Random seed for reproducibility.                           |
| `suffix`        | string  | No       | Custom suffix for the resulting job or artifact name.      |

### Method fields

| Field                               | Type   | Required | Description                          |
| ----------------------------------- | ------ | -------- | ------------------------------------ |
| `method.type`                       | string | Yes      | Must be `spec-draft`.                |
| `method.spec_draft.hyperparameters` | object | Yes      | Speculator training hyperparameters. |

### Hyperparameters

| Field                | Type    | Description                                                                                                             |
| -------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------- |
| `batch_size`         | integer | Training batch size. A large batch size (e.g. `32`) is recommended.                                                     |
| `learning_rate`      | number  | Optimizer learning rate. `0.0075` is recommended for models under 100B parameters; use a lower value for larger models. |
| `n_epochs`           | integer | Number of training epochs.                                                                                              |
| `warmup_ratio`       | number  | Fraction of total steps used for learning-rate warmup.                                                                  |
| `max_grad_norm`      | number  | Gradient clipping threshold.                                                                                            |
| `context_length`     | integer | Maximum sequence length used during training.                                                                           |
| `num_decoding_heads` | integer | Number of speculative decoding heads to train. Supported values: `1`–`7`.                                               |
| `architecture`       | string  | Drafter architecture. Supported value: `eagle3_original` & `eagle3`.                                                    |
| `loss.type`          | string  | Speculator training loss. Supported values: `kl`, `lk_alpha`, `lk_hybrid`.                                              |

## Supported architectures

| Architecture      | Description                                                                      |
| ----------------- | -------------------------------------------------------------------------------- |
| `eagle3`          | TokenFactory-optimized variant. Can only be served on the TokenFactory platform. |
| `eagle3_original` | Reference implementation. Portable to any vLLM-supported platform.               |

## Supported losses

Set the training loss with `loss.type`. Supported values:

* **`kl`** — Standard Kullback-Leibler divergence loss between drafter and target model logits.
* **`lk_alpha`** — LK loss with KL disabled.
* **`lk_hybrid`** — Weighted combination of KL and LK losses.

## Training metrics

Custom speculator jobs report two categories of metrics:

* **Loss metrics** for monitoring optimization and model behavior during training.
* **Sampling performance metrics** for estimating speculative decoding quality and efficiency.

<Note>
  Some per-head metrics are reported conditionally. For a given head, the metric is computed under the assumption that all previous speculative tokens were accepted. For example, metrics for head 4 are conditioned on tokens 1 through 3 being accepted.
</Note>

### Loss metrics

The following loss metrics are reported during training:

| Metric                     | Description                                                                                                            |
| -------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| `lk` (per head)            | LK loss for each decoding head.                                                                                        |
| `kl` (per head)            | Kullback-Leibler divergence loss between drafter and target model logits.                                              |
| `cross_entropy` (per head) | Cross-entropy loss for each decoding head.                                                                             |
| `target_model_loss`        | Cross-entropy loss of the target model. Useful for understanding how the target model behaves on the provided dataset. |
| `draft_loss`               | Main speculative decoding training loss reported for the job.                                                          |

<Info>
  All of the losses above are reported regardless of which `loss.type` is selected.
</Info>

### Sampling performance metrics

Each decoding head reports the following sampling performance metrics:

| Metric              | Description                                                                 |
| ------------------- | --------------------------------------------------------------------------- |
| `acceptance_rate`   | Expected probability that the speculative token from that head is accepted. |
| `number_of_tokens`  | Expected accepted token length.                                             |
| `system_efficiency` | Speculative decoding efficiency, measured as `#tokens / #decoding_heads`.   |

### Metric variations

Sampling metrics may be reported for multiple evaluation variations:

<AccordionGroup>
  <Accordion title="default">
    Standard sampling with `temperature=1.0`. Use this variation to understand expected acceptance behavior under normal sampling.
  </Accordion>
</AccordionGroup>

## Output artefacts

A completed custom speculator job produces the following artifact structure:

```text theme={null}
checkpoint/
  model.safetensors  # speculator weights
  config.json        # speculator configuration
```

To download the resulting artefacts, see [Download checkpoints and model files](https://docs.tokenfactory.nebius.com/post-training/how-to-fine-tune#7-download-checkpoints-and-model-files).

## Serving with vLLM

Serve the resulting artifact with vLLM by passing it through `--speculative-config`:

```bash theme={null}
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"model": "checkpoints/draft_heads/eagle3_custom_v1.0/", "num_speculative_tokens": 3, "method": "eagle3"}'
```

<Warning>
  **Eagle weight tying.** For Eagle-based architectures, weights are tied across heads. Only one head's weights are stored in the artifact, even when the model was trained with multiple decoding heads (e.g. `num_decoding_heads=3`).

  As a result, the speculator can be served with any `num_speculative_tokens` value from `1` to `N`, but it typically performs best when `num_speculative_tokens` matches or is lower than the number of decoding heads used during training.
</Warning>

### Tuning `ntokens`

Set `ntokens` close to the observed acceptance rate to balance speculation against compute overhead. For example, if `ntokens=7` yields an acceptance rate of \~3.86, dropping to `3` or `4` will typically improve throughput.
