Skip to main content
Use the Custom Speculator API to train a speculative decoding drafter for a supported base model in Token Factory. A custom speculator job produces a drafter artifact that can be served with vLLM for speculative decoding.

Create a custom speculator job

Create a fine-tuning job with method.type="spec-draft" and provide speculator hyperparameters under method.spec_draft.hyperparameters.
from openai import AsyncOpenAI

job_request = {
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "training_file": training_file,
    "method": {
        "type": "spec-draft",
        "spec_draft": {
            "hyperparameters": {
                "batch_size": 8,
                "learning_rate": 0.0075,
                "n_epochs": 1,
                "warmup_ratio": 0.1,
                "max_grad_norm": 1.0,
                "context_length": 8192,
                "loss": {"type": "kl"},
                "num_decoding_heads": 3,
                "architecture": "eagle3_original",
            },
        },
    },
    "seed": 2214,
    "suffix": "specdec-example",
}

client = AsyncOpenAI(
    base_url="<base_url>",
    api_key="<token>",
)

job = await client.fine_tuning.jobs.create(**job_request)

Dataset format

Only pretokenized datasets are supported. Upload a JSONL file where each line contains:
  • input_ids — array of token IDs.
  • attention_mask — array of 1s and 0s aligned with input_ids.
  • labels — array of target token IDs. Use -100 to ignore a position in the loss.
{"input_ids": [1, 14, 52, 16], "attention_mask": [1, 1, 1, 1], "labels": [-100, -100, 52, 16]}
{"input_ids": [160, 34, 129, 10432], "attention_mask": [1, 1, 1, 1], "labels": [160, 34, 129, 10432]}

Request fields

Top-level fields

FieldTypeRequiredDescription
modelstringYesBase model to train the speculator against.
training_filestringYesTraining dataset file ID.
methodobjectYesFine-tuning method configuration. Use type="spec-draft".
seedintegerNoRandom seed for reproducibility.
suffixstringNoCustom suffix for the resulting job or artifact name.

Method fields

FieldTypeRequiredDescription
method.typestringYesMust be spec-draft.
method.spec_draft.hyperparametersobjectYesSpeculator training hyperparameters.

Hyperparameters

FieldTypeDescription
batch_sizeintegerTraining batch size. A large batch size (e.g. 32) is recommended.
learning_ratenumberOptimizer learning rate. 0.0075 is recommended for models under 100B parameters; use a lower value for larger models.
n_epochsintegerNumber of training epochs.
warmup_rationumberFraction of total steps used for learning-rate warmup.
max_grad_normnumberGradient clipping threshold.
context_lengthintegerMaximum sequence length used during training.
num_decoding_headsintegerNumber of speculative decoding heads to train. Supported values: 17.
architecturestringDrafter architecture. Supported value: eagle3_original & eagle3.
loss.typestringSpeculator training loss. Supported values: kl, lk_alpha, lk_hybrid.

Supported architectures

ArchitectureDescription
eagle3TokenFactory-optimized variant. Can only be served on the TokenFactory platform.
eagle3_originalReference implementation. Portable to any vLLM-supported platform.

Supported losses

Set the training loss with loss.type. Supported values:
  • kl — Standard Kullback-Leibler divergence loss between drafter and target model logits.
  • lk_alpha — LK loss with KL disabled.
  • lk_hybrid — Weighted combination of KL and LK losses.

Training metrics

Custom speculator jobs report two categories of metrics:
  • Loss metrics for monitoring optimization and model behavior during training.
  • Sampling performance metrics for estimating speculative decoding quality and efficiency.
Some per-head metrics are reported conditionally. For a given head, the metric is computed under the assumption that all previous speculative tokens were accepted. For example, metrics for head 4 are conditioned on tokens 1 through 3 being accepted.

Loss metrics

The following loss metrics are reported during training:
MetricDescription
lk (per head)LK loss for each decoding head.
kl (per head)Kullback-Leibler divergence loss between drafter and target model logits.
cross_entropy (per head)Cross-entropy loss for each decoding head.
target_model_lossCross-entropy loss of the target model. Useful for understanding how the target model behaves on the provided dataset.
draft_lossMain speculative decoding training loss reported for the job.
All of the losses above are reported regardless of which loss.type is selected.

Sampling performance metrics

Each decoding head reports the following sampling performance metrics:
MetricDescription
acceptance_rateExpected probability that the speculative token from that head is accepted.
number_of_tokensExpected accepted token length.
system_efficiencySpeculative decoding efficiency, measured as #tokens / #decoding_heads.

Metric variations

Sampling metrics may be reported for multiple evaluation variations:
Standard sampling with temperature=1.0. Use this variation to understand expected acceptance behavior under normal sampling.

Output artefacts

A completed custom speculator job produces the following artifact structure:
checkpoint/
  model.safetensors  # speculator weights
  config.json        # speculator configuration
To download the resulting artefacts, see Download checkpoints and model files.

Serving with vLLM

Serve the resulting artifact with vLLM by passing it through --speculative-config:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"model": "checkpoints/draft_heads/eagle3_custom_v1.0/", "num_speculative_tokens": 3, "method": "eagle3"}'
Eagle weight tying. For Eagle-based architectures, weights are tied across heads. Only one head’s weights are stored in the artifact, even when the model was trained with multiple decoding heads (e.g. num_decoding_heads=3).As a result, the speculator can be served with any num_speculative_tokens value from 1 to N, but it typically performs best when num_speculative_tokens matches or is lower than the number of decoding heads used during training.

Tuning ntokens

Set ntokens close to the observed acceptance rate to balance speculation against compute overhead. For example, if ntokens=7 yields an acceptance rate of ~3.86, dropping to 3 or 4 will typically improve throughput.