Use the Custom Speculator API to train a speculative decoding drafter for a supported base model in Token Factory. A custom speculator job produces a drafter artifact that can be served with vLLM for speculative decoding.
Create a custom speculator job
Create a fine-tuning job with method.type="spec-draft" and provide speculator hyperparameters under method.spec_draft.hyperparameters.
from openai import AsyncOpenAI
job_request = {
"model" : "Qwen/Qwen3-30B-A3B-Instruct-2507" ,
"training_file" : training_file,
"method" : {
"type" : "spec-draft" ,
"spec_draft" : {
"hyperparameters" : {
"batch_size" : 8 ,
"learning_rate" : 0.0075 ,
"n_epochs" : 1 ,
"warmup_ratio" : 0.1 ,
"max_grad_norm" : 1.0 ,
"context_length" : 8192 ,
"loss" : { "type" : "kl" },
"num_decoding_heads" : 3 ,
"architecture" : "eagle3_original" ,
},
},
},
"seed" : 2214 ,
"suffix" : "specdec-example" ,
}
client = AsyncOpenAI(
base_url = "<base_url>" ,
api_key = "<token>" ,
)
job = await client.fine_tuning.jobs.create( ** job_request)
Only pretokenized datasets are supported. Upload a JSONL file where each line contains:
input_ids — array of token IDs.
attention_mask — array of 1s and 0s aligned with input_ids.
labels — array of target token IDs. Use -100 to ignore a position in the loss.
{ "input_ids" : [ 1 , 14 , 52 , 16 ], "attention_mask" : [ 1 , 1 , 1 , 1 ], "labels" : [ -100 , -100 , 52 , 16 ]}
{ "input_ids" : [ 160 , 34 , 129 , 10432 ], "attention_mask" : [ 1 , 1 , 1 , 1 ], "labels" : [ 160 , 34 , 129 , 10432 ]}
Request fields
Top-level fields
Field Type Required Description modelstring Yes Base model to train the speculator against. training_filestring Yes Training dataset file ID. methodobject Yes Fine-tuning method configuration. Use type="spec-draft". seedinteger No Random seed for reproducibility. suffixstring No Custom suffix for the resulting job or artifact name.
Method fields
Field Type Required Description method.typestring Yes Must be spec-draft. method.spec_draft.hyperparametersobject Yes Speculator training hyperparameters.
Hyperparameters
Field Type Description batch_sizeinteger Training batch size. A large batch size (e.g. 32) is recommended. learning_ratenumber Optimizer learning rate. 0.0075 is recommended for models under 100B parameters; use a lower value for larger models. n_epochsinteger Number of training epochs. warmup_rationumber Fraction of total steps used for learning-rate warmup. max_grad_normnumber Gradient clipping threshold. context_lengthinteger Maximum sequence length used during training. num_decoding_headsinteger Number of speculative decoding heads to train. Supported values: 1–7. architecturestring Drafter architecture. Supported value: eagle3_original & eagle3. loss.typestring Speculator training loss. Supported values: kl, lk_alpha, lk_hybrid.
Supported architectures
Architecture Description eagle3TokenFactory-optimized variant. Can only be served on the TokenFactory platform. eagle3_originalReference implementation. Portable to any vLLM-supported platform.
Supported losses
Set the training loss with loss.type. Supported values:
kl — Standard Kullback-Leibler divergence loss between drafter and target model logits.
lk_alpha — LK loss with KL disabled.
lk_hybrid — Weighted combination of KL and LK losses.
Training metrics
Custom speculator jobs report two categories of metrics:
Loss metrics for monitoring optimization and model behavior during training.
Sampling performance metrics for estimating speculative decoding quality and efficiency.
Some per-head metrics are reported conditionally. For a given head, the metric is computed under the assumption that all previous speculative tokens were accepted. For example, metrics for head 4 are conditioned on tokens 1 through 3 being accepted.
Loss metrics
The following loss metrics are reported during training:
Metric Description lk (per head)LK loss for each decoding head. kl (per head)Kullback-Leibler divergence loss between drafter and target model logits. cross_entropy (per head)Cross-entropy loss for each decoding head. target_model_lossCross-entropy loss of the target model. Useful for understanding how the target model behaves on the provided dataset. draft_lossMain speculative decoding training loss reported for the job.
All of the losses above are reported regardless of which loss.type is selected.
Each decoding head reports the following sampling performance metrics:
Metric Description acceptance_rateExpected probability that the speculative token from that head is accepted. number_of_tokensExpected accepted token length. system_efficiencySpeculative decoding efficiency, measured as #tokens / #decoding_heads.
Metric variations
Sampling metrics may be reported for multiple evaluation variations:
Standard sampling with temperature=1.0. Use this variation to understand expected acceptance behavior under normal sampling.
Output artefacts
A completed custom speculator job produces the following artifact structure:
checkpoint/
model.safetensors # speculator weights
config.json # speculator configuration
To download the resulting artefacts, see Download checkpoints and model files .
Serving with vLLM
Serve the resulting artifact with vLLM by passing it through --speculative-config:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{"model": "checkpoints/draft_heads/eagle3_custom_v1.0/", "num_speculative_tokens": 3, "method": "eagle3"}'
Eagle weight tying. For Eagle-based architectures, weights are tied across heads. Only one head’s weights are stored in the artifact, even when the model was trained with multiple decoding heads (e.g. num_decoding_heads=3).As a result, the speculator can be served with any num_speculative_tokens value from 1 to N, but it typically performs best when num_speculative_tokens matches or is lower than the number of decoding heads used during training.
Tuning ntokens
Set ntokens close to the observed acceptance rate to balance speculation against compute overhead. For example, if ntokens=7 yields an acceptance rate of ~3.86, dropping to 3 or 4 will typically improve throughput.