Skip to main content
You can create datasets for training a model. Nebius Token Factory supports the JSON Lines format (.jsonl) for dataset files. You can use one of the following dataset types: The size limit for dataset files is 5 gigabytes.

Conversational data

You can train a model using chat-style data. Each training example is a single conversation, represented as one messages array. We **expect the last message to be from the **assistant, containing the final answer the model should produce for that conversation.

Basic example:
{
  "messages": [
    { "role": "system", "content": "You are a friendly dinner planner." },
    { "role": "user", "content": "What's for dinner tonight?" },
    { "role": "assistant", "content": "What cuisine do you prefer?" },
    { "role": "user", "content": "I wouldn't mind Italian." },
    { "role": "assistant", "content": "Then let's try a new pasta!" }
  ]
}
Examples of conversational datasets:

Tool-augmented conversations

You can also train the model on conversations that use tools (function calling). Each training example can optionally include a tools field that defines the tools available in that conversation. The schema should match what you plan to use at inference time.
{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city.",
        "parameters": {
          "type": "object",
          "properties": {
            "city": { "type": "string" }
          },
          "required": ["city"]
        }
      }
    }
  ],
  "messages": [
    { "role": "system", "content": "You are a weather assistant." },
    { "role": "user", "content": "What's the weather in Amsterdam?" },

    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "call_1",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"city\": \"Amsterdam\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "call_1",
      "name": "get_weather",
      "content": "{\"temp_c\": 12, \"condition\": \"cloudy\"}"
    },

    {
      "role": "assistant",
      "content": "It's about 12°C and cloudy in Amsterdam right now."
    }
  ]
}
Key points:
  • tools is optional. Omit it for pure conversational data.
  • You may include examples with tool calls to teach the model:
    • when to call a tool,
    • how to format tool_calls,
    • and how to turn tool results into a final answer.
  • The **final message in **messages**should always be an ** assistantmessage with the natural-language answer you want the model to produce (not a tool call).

Instruction data

You can specify prompts and the expected answers to them:
{"prompt": "Capital of Australia", "completion": "Canberra"}
{"prompt": "Distance between Rome and Kuala Lumpur", "completion": "9703 km"}
Example of an instruction dataset: Andzej-75/German_RisingWorld_prompt-text-rejected_Jsonl.

Text data

If you have unstructured or non-chat data (tasks, explanations, documents, QA pairs, etc.), you can train using plain text.
Each training example is a single JSON line with a text field.
Example (reasoning-style prompt/answer):
{"text": "Given 10 items, find a fair way to divide them among 3 people.\n\n<Answer>: Each person gets 3 items, and 1 item remains. You can rotate who gets the extra item over time to keep it fair."}
{"text": "You are given a list of transactions. Classify whether the account is at risk of fraud.\n\nTransactions: [...]\n\n<Label>: HIGH_RISK"}
You can also use this format for longer explanations, documentation-like text, or multi-step solutions:
{"text": "Task: Explain how compound interest works to a beginner.\n\nAnswer: Compound interest is interest calculated on the original amount plus any interest that has already been added. This means..."}
Example of a text dataset: FranciscoMacaya/train_model.jsonl.

Pre-tokenized data

When to use pretokenized datasets

Use the Pretokenized dataset format if you want to:
  • Use a custom chat template that differs from the default instruct template.
  • Define a chat template for base models that don’t have one by default.
  • Have full control over tokenization, packing, loss masking, and attention masking.
You tokenize everything yourself and upload sequences directly.

Pretokenized dataset schema

Each sample supports the following fields:
  • token_ids: list[int]required
    The token IDs for the full sequence.
  • labels: list[int] – optional
    Per-token labels for loss computation.
    • Use -100 at a position to mask out loss for that token (no gradient from that position).
  • attention_mask: list[int] – optional
    Segment-aware attention mask and packing layout.
    The mask must be a non-decreasing sequence starting with 1, with optional trailing 0s for padding. Valid examples:
    • Single sample (no packing):
      [1, 1, 1, 1, 1, 1, 1]
      
    • Multiple segments packed into one sequence:
      [1, 1, 2, 2, 3, 3, 3]
      
      Here, 1, 2, 3 indicate different segments. Our attention implementation prevents cross-segment attention, so each segment only attends within itself.
    • Packed segments with padding:
      [1, 1, 1, 2, 2, 3, 3, 3, 0, 0, 0, 0]
      

Using packing with pretokenized datasets

You can combine:
  • packing = true (hyperparameter), and
  • a Pretokenized dataset.
In that case:
  • Each uploaded sample must have an attention_mask of all ones (the “single sample” pattern):
    [1, 1, 1, 1, 1, 1, 1]
    
  • Nebius will perform the packing for you based on context_length.
Note: For some models, packing = false is not allowed — those models can only be trained with packing enabled.

Valid JSONL examples

Each line is one training sample:
{"token_ids": [10, 20, 30]}
{"token_ids": [10, 20, 30], "labels": [10, 20, -100]}
{"token_ids": [10, 20, 30], "attention_mask": [1, 1, 1]}
{"token_ids": [10, 20, 30], "labels": [10, 20, -100], "attention_mask": [1, 1, 1]}
Constraints to keep in mind:
  • len(token_ids) == len(labels) (if labels provided).
  • len(token_ids) == len(attention_mask) (if attention_mask provided).
  • All token_ids must be from the model’s default tokenizer vocabulary (no custom tokens).

Packing for efficient fine-tuning

When you fine-tune on samples of different lengths, the standard approach is to pad all samples in a batch to the longest one. This has two main downsides:
  • The number of tokens per batch fluctuates a lot between steps.
  • That variability leads to noisier gradients and can degrade final model quality.
Packing fixes this by combining multiple shorter samples into a single sequence while preventing them from interacting with each other. For typical Transformer architectures, tokens only interact via attention. In our implementation, attention is masked across segments, so tokens from one segment never attend to tokens from another. Attention is computed independently per segment, but all segments share the same packed sequence. This gives you:
  • Batches with a more consistent number of training tokens
  • More stable gradients and smoother training dynamics
  • Better compute efficiency, especially when your dataset has many short examples
When packing is enabled, we:
  • Combine multiple samples from your dataset into a single sequence of length context_length
  • Ensure that no sample is truncated (neither on the left nor on the right)
Note: For some large MoE models, packing is mandatory and cannot be disabled.
For more background, see:
Efficient LLM Pretraining: Packed Sequences and Masked Attention
If you want to implement your own packing strategy, you can do it client-side and upload the data in the Pretokenized dataset format, which supports segments via attention_mask.