.jsonl) for dataset files. You can use one of the following dataset types:
The size limit for dataset files is 5 gigabytes.
Conversational data
You can train a model using chat-style data. Each training example is a single conversation, represented as onemessages array.
We **expect the last message to be from the **assistant, containing the final answer the model should produce for that conversation.Basic example:
Tool-augmented conversations
You can also train the model on conversations that use tools (function calling). Each training example can optionally include atools field that defines the tools available in that conversation. The schema should match what you plan to use at inference time.
toolsis optional. Omit it for pure conversational data.- You may include examples with tool calls to teach the model:
- when to call a tool,
- how to format
tool_calls, - and how to turn tool results into a final answer.
- The **final message in **
messages**should always be an **assistantmessage with the natural-language answer you want the model to produce (not a tool call).
Instruction data
You can specify prompts and the expected answers to them:Text data
If you have unstructured or non-chat data (tasks, explanations, documents, QA pairs, etc.), you can train using plain text.Each training example is a single JSON line with a
text field.
Example (reasoning-style prompt/answer):
Pre-tokenized data
When to use pretokenized datasets
Use the Pretokenized dataset format if you want to:- Use a custom chat template that differs from the default instruct template.
- Define a chat template for base models that don’t have one by default.
- Have full control over tokenization, packing, loss masking, and attention masking.
Pretokenized dataset schema
Each sample supports the following fields:-
token_ids: list[int]– required
The token IDs for the full sequence. -
labels: list[int]– optional
Per-token labels for loss computation.- Use
-100at a position to mask out loss for that token (no gradient from that position).
- Use
-
attention_mask: list[int]– optional
Segment-aware attention mask and packing layout. The mask must be a non-decreasing sequence starting with1, with optional trailing0s for padding. Valid examples:-
Single sample (no packing):
-
Multiple segments packed into one sequence:
Here,
1,2,3indicate different segments. Our attention implementation prevents cross-segment attention, so each segment only attends within itself. -
Packed segments with padding:
-
Single sample (no packing):
Using packing with pretokenized datasets
You can combine:
packing = true(hyperparameter), and- a Pretokenized dataset.
-
Each uploaded sample must have an
attention_maskof all ones (the “single sample” pattern): -
Nebius will perform the packing for you based on
context_length.
Note: For some models, packing = false is not allowed — those models can only be trained with packing enabled.
Valid JSONL examples
Each line is one training sample:len(token_ids) == len(labels)(iflabelsprovided).len(token_ids) == len(attention_mask)(ifattention_maskprovided).- All
token_idsmust be from the model’s default tokenizer vocabulary (no custom tokens).
Packing for efficient fine-tuning
When you fine-tune on samples of different lengths, the standard approach is to pad all samples in a batch to the longest one. This has two main downsides:- The number of tokens per batch fluctuates a lot between steps.
- That variability leads to noisier gradients and can degrade final model quality.
- Batches with a more consistent number of training tokens
- More stable gradients and smoother training dynamics
- Better compute efficiency, especially when your dataset has many short examples
- Combine multiple samples from your dataset into a single sequence of length
context_length - Ensure that no sample is truncated (neither on the left nor on the right)
Note: For some large MoE models, packing is mandatory and cannot be disabled.For more background, see:
Efficient LLM Pretraining: Packed Sequences and Masked AttentionIf you want to implement your own packing strategy, you can do it client-side and upload the data in the Pretokenized dataset format, which supports segments via
attention_mask.