custom-speculator - Nebius Token Factory documentation

Use the Custom Speculator API to train a speculative decoding drafter for a supported base model in Token Factory. A custom speculator job produces a drafter artifact that can be served with vLLM for speculative decoding.

Create a custom speculator job

Create a fine-tuning job with method.type="spec-draft" and provide speculator hyperparameters under method.spec_draft.hyperparameters.

from openai import AsyncOpenAI

job_request = {
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "training_file": training_file,
    "method": {
        "type": "spec-draft",
        "spec_draft": {
            "hyperparameters": {
                "batch_size": 8,
                "learning_rate": 0.0075,
                "n_epochs": 1,
                "warmup_ratio": 0.1,
                "max_grad_norm": 1.0,
                "context_length": 8192,
                "loss": {"type": "kl"},
                "num_decoding_heads": 3,
                "architecture": "eagle3_original",
            },
        },
    },
    "seed": 2214,
    "suffix": "specdec-example",
}

client = AsyncOpenAI(
    base_url="<base_url>",
    api_key="<token>",
)

job = await client.fine_tuning.jobs.create(**job_request)

Dataset format

Only pretokenized datasets are supported. Upload a JSONL file where each line contains:

input_ids — array of token IDs.
attention_mask — array of 1s and 0s aligned with input_ids.
labels — array of target token IDs. Use -100 to ignore a position in the loss.

{"input_ids": [1, 14, 52, 16], "attention_mask": [1, 1, 1, 1], "labels": [-100, -100, 52, 16]}
{"input_ids": [160, 34, 129, 10432], "attention_mask": [1, 1, 1, 1], "labels": [160, 34, 129, 10432]}

Request fields

Top-level fields

Field	Type	Required	Description
`model`	string	Yes	Base model to train the speculator against.
`training_file`	string	Yes	Training dataset file ID.
`method`	object	Yes	Fine-tuning method configuration. Use `type="spec-draft"`.
`seed`	integer	No	Random seed for reproducibility.
`suffix`	string	No	Custom suffix for the resulting job or artifact name.

Method fields

Field	Type	Required	Description
`method.type`	string	Yes	Must be `spec-draft`.
`method.spec_draft.hyperparameters`	object	Yes	Speculator training hyperparameters.

Hyperparameters

Field	Type	Description
`batch_size`	integer	Training batch size. A large batch size (e.g. `32`) is recommended.
`learning_rate`	number	Optimizer learning rate. `0.0075` is recommended for models under 100B parameters; use a lower value for larger models.
`n_epochs`	integer	Number of training epochs.
`warmup_ratio`	number	Fraction of total steps used for learning-rate warmup.
`max_grad_norm`	number	Gradient clipping threshold.
`context_length`	integer	Maximum sequence length used during training.
`num_decoding_heads`	integer	Number of speculative decoding heads to train. Supported values: `1`–`7`.
`architecture`	string	Drafter architecture. Supported value: `eagle3_original` & `eagle3`.
`loss.type`	string	Speculator training loss. Supported values: `kl`, `lk_alpha`, `lk_hybrid`.

Supported architectures

Architecture	Description
`eagle3`	TokenFactory-optimized variant. Can only be served on the TokenFactory platform.
`eagle3_original`	Reference implementation. Portable to any vLLM-supported platform.

Supported losses

Set the training loss with loss.type. Supported values:

kl — Standard Kullback-Leibler divergence loss between drafter and target model logits.
lk_alpha — LK loss with KL disabled.
lk_hybrid — Weighted combination of KL and LK losses.

Training metrics

Custom speculator jobs report two categories of metrics:

Loss metrics for monitoring optimization and model behavior during training.
Sampling performance metrics for estimating speculative decoding quality and efficiency.

Some per-head metrics are reported conditionally. For a given head, the metric is computed under the assumption that all previous speculative tokens were accepted. For example, metrics for head 4 are conditioned on tokens 1 through 3 being accepted.

Loss metrics

The following loss metrics are reported during training:

Metric	Description
`lk` (per head)	LK loss for each decoding head.
`kl` (per head)	Kullback-Leibler divergence loss between drafter and target model logits.
`cross_entropy` (per head)	Cross-entropy loss for each decoding head.
`target_model_loss`	Cross-entropy loss of the target model. Useful for understanding how the target model behaves on the provided dataset.
`draft_loss`	Main speculative decoding training loss reported for the job.

All of the losses above are reported regardless of which loss.type is selected.

Sampling performance metrics

Each decoding head reports the following sampling performance metrics:

Metric	Description
`acceptance_rate`	Expected probability that the speculative token from that head is accepted.
`number_of_tokens`	Expected accepted token length.
`system_efficiency`	Speculative decoding efficiency, measured as `#tokens / #decoding_heads`.

Metric variations

Sampling metrics may be reported for multiple evaluation variations:

default

Standard sampling with temperature=1.0. Use this variation to understand expected acceptance behavior under normal sampling.

Output artefacts

A completed custom speculator job produces the following artifact structure:

checkpoint/
  model.safetensors  # speculator weights
  config.json        # speculator configuration

To download the resulting artefacts, see Download checkpoints and model files.

Serving with vLLM

Serve the resulting artifact with vLLM by passing it through --speculative-config:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"model": "checkpoints/draft_heads/eagle3_custom_v1.0/", "num_speculative_tokens": 3, "method": "eagle3"}'

Eagle weight tying. For Eagle-based architectures, weights are tied across heads. Only one head’s weights are stored in the artifact, even when the model was trained with multiple decoding heads (e.g. num_decoding_heads=3).As a result, the speculator can be served with any num_speculative_tokens value from 1 to N, but it typically performs best when num_speculative_tokens matches or is lower than the number of decoding heads used during training.

Tuning `ntokens`

Set ntokens close to the observed acceptance rate to balance speculation against compute overhead. For example, if ntokens=7 yields an acceptance rate of ~3.86, dropping to 3 or 4 will typically improve throughput.

​Create a custom speculator job

​Dataset format

​Request fields

​Top-level fields

​Method fields

​Hyperparameters

​Supported architectures

​Supported losses

​Training metrics

​Loss metrics

​Sampling performance metrics

​Metric variations

​Output artefacts

​Serving with vLLM

​Tuning ntokens

Create a custom speculator job

Dataset format

Request fields

Top-level fields

Method fields

Hyperparameters

Supported architectures

Supported losses

Training metrics

Loss metrics

Sampling performance metrics

Metric variations

Output artefacts

Serving with vLLM

Tuning `ntokens`