> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

#  Dataset formats for fine-tuning

You can create datasets for training a model. Nebius Token Factory supports the [JSON Lines](https://jsonlines.org/) format (`.jsonl`) for dataset files. You can use one of the following dataset types:

* [Conversational](https://docs.tokenfactory.nebius.com/post-training/datasets#conversational-data)
* [Instruction](https://docs.tokenfactory.nebius.com/post-training/datasets#instruction-data)
* [Text](https://docs.tokenfactory.nebius.com/post-training/datasets#text-data)
* [**Pretokenized**](https://docs.tokenfactory.nebius.com/post-training/datasets#pre-tokenized-data)

The size limit for dataset files is **50GB** via Data Lab and 20GB via Files api.

## **Conversational data**

You can train a model using chat-style data. Each training example is a single conversation, represented as one `messages` array.

We \*\*expect the last message to be from the \*\*`assistant`, containing the final answer the model should produce for that conversation.\
\
**Basic example:**

```json theme={null}
{
  "messages": [
    { "role": "system", "content": "You are a friendly dinner planner." },
    { "role": "user", "content": "What's for dinner tonight?" },
    { "role": "assistant", "content": "What cuisine do you prefer?" },
    { "role": "user", "content": "I wouldn't mind Italian." },
    { "role": "assistant", "content": "Then let's try a new pasta!" }
  ]
}
```

Examples of conversational datasets:

* [olathepavilion/Conversational-datasets-json](https://huggingface.co/datasets/olathepavilion/Conversational-datasets-json/blob/main/Validation.jsonl)
* [princefreddy/jsonL\_text\_to\_sql](https://huggingface.co/datasets/princefreddy/jsonL_text_to_sql/blob/main/chatbot_interactions.jsonl)

### Tool-augmented conversations

You can also train the model on conversations that use tools (function calling).

Each training example can optionally include a `tools` field that defines the tools available in that conversation. The schema should match what you plan to use at inference time.

```json theme={null}
{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a city.",
        "parameters": {
          "type": "object",
          "properties": {
            "city": { "type": "string" }
          },
          "required": ["city"]
        }
      }
    }
  ],
  "messages": [
    { "role": "system", "content": "You are a weather assistant." },
    { "role": "user", "content": "What's the weather in Amsterdam?" },

    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "call_1",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"city\": \"Amsterdam\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "call_1",
      "name": "get_weather",
      "content": "{\"temp_c\": 12, \"condition\": \"cloudy\"}"
    },

    {
      "role": "assistant",
      "content": "It's about 12°C and cloudy in Amsterdam right now."
    }
  ]
}
```

Key points:

* `tools` is **optional**. Omit it for pure conversational data.
* You may include **examples with tool calls** to teach the model:
  * when to call a tool,
  * how to format `tool_calls`,
  * and how to turn tool results into a final answer.
* The \*\*final message in \*\*`messages`\*\*should always be an \*\* `assistant`**message**  with the natural-language answer you want the model to produce (not a tool call).

## **Instruction data**

You can specify prompts and the expected answers to them:

```bash theme={null}
{"prompt": "Capital of Australia", "completion": "Canberra"}
{"prompt": "Distance between Rome and Kuala Lumpur", "completion": "9703 km"}
```

Example of an instruction dataset: [Andzej-75/German\_RisingWorld\_prompt-text-rejected\_Jsonl](https://huggingface.co/datasets/Andzej-75/German_RisingWorld_prompt-text-rejected_Jsonl/blob/main/training_data_Lines.jsonl).

## **Text data**

If you have **unstructured or non-chat data** (tasks, explanations, documents, QA pairs, etc.), you can train using plain text.\
Each training example is a single JSON line with a `text` field.

Example (reasoning-style prompt/answer):

```json theme={null}
{"text": "Given 10 items, find a fair way to divide them among 3 people.\n\n<Answer>: Each person gets 3 items, and 1 item remains. You can rotate who gets the extra item over time to keep it fair."}
{"text": "You are given a list of transactions. Classify whether the account is at risk of fraud.\n\nTransactions: [...]\n\n<Label>: HIGH_RISK"}
```

You can also use this format for longer explanations, documentation-like text, or multi-step solutions:

```json theme={null}
{"text": "Task: Explain how compound interest works to a beginner.\n\nAnswer: Compound interest is interest calculated on the original amount plus any interest that has already been added. This means..."}
```

Example of a text dataset: [FranciscoMacaya/train\_model.jsonl](https://huggingface.co/datasets/FranciscoMacaya/train_model.jsonl/blob/main/train_model.jsonl).

## **Pre-tokenized data**

### When to use pretokenized datasets

Use the **Pretokenized** dataset format if you want to:

* Use a **custom chat template** that differs from the default instruct template.
* Define a **chat template for base models** that don’t have one by default.
* Have full control over **tokenization, packing, loss masking, and attention masking**.

You tokenize everything yourself and upload sequences directly.

### Pretokenized dataset schema

Each sample supports the following fields:

* `input_ids: list[int]` – **required**\
  The Input ID for the full sequence.
* `labels: list[int]` – optional\
  Per-token labels for loss computation.
  * Use `-100` at a position to **mask out loss** for that token (no gradient from that position).
* `attention_mask: list[int]` – optional\
  Segment-aware attention mask and packing layout. The mask must be a **non-decreasing sequence** starting with `1`, with optional trailing `0`s for padding. Valid examples:
  * Single sample (no packing):
    ```text theme={null}
    [1, 1, 1, 1, 1, 1, 1]
    ```
  * Multiple segments packed into one sequence:
    ```text theme={null}
    [1, 1, 2, 2, 3, 3, 3]
    ```
    Here, `1`, `2`, `3` indicate different segments. Our attention implementation **prevents cross-segment attention**, so each segment only attends within itself.
  * Packed segments with padding:
    ```text theme={null}
    [1, 1, 1, 2, 2, 3, 3, 3, 0, 0, 0, 0]
    ```

***

### Using `packing` with pretokenized datasets

You can **combine**:

* `packing = true` (hyperparameter), **and**
* a **Pretokenized** dataset.

In that case:

* Each uploaded sample must have an `attention_mask` of **all ones** (the “single sample” pattern):
  ```text theme={null}
  [1, 1, 1, 1, 1, 1, 1]
  ```
* Nebius will perform the **packing for you** based on `context_length`.

> Note: For some models, `packing = false` is not allowed — those models can only be trained with packing enabled.

### Valid JSONL examples

Each line is one training sample:

```text theme={null}
{"token_ids": [10, 20, 30]}
{"token_ids": [10, 20, 30], "labels": [10, 20, -100]}
{"token_ids": [10, 20, 30], "attention_mask": [1, 1, 1]}
{"token_ids": [10, 20, 30], "labels": [10, 20, -100], "attention_mask": [1, 1, 1]}
```

Constraints to keep in mind:

* `len(token_ids) == len(labels)` (if `labels` provided).
* `len(token_ids) == len(attention_mask)` (if `attention_mask` provided).
* All `token_ids` must be from the **model’s default tokenizer vocabulary** (no custom tokens).

***

## Packing for efficient fine-tuning

When you fine-tune on samples of different lengths, the standard approach is to **pad** all samples in a batch to the longest one. This has two main downsides:

* The **number of tokens per batch fluctuates** a lot between steps.
* That variability leads to **noisier gradients** and can degrade final model quality.

**Packing** fixes this by combining multiple shorter samples into a single sequence while preventing them from interacting with each other.

For typical Transformer architectures, tokens only interact via **attention**. In our implementation, attention is **masked across segments**, so tokens from one segment never attend to tokens from another. Attention is computed independently per segment, but all segments share the same packed sequence.

This gives you:

* Batches with a **more consistent number of training tokens**
* **More stable gradients** and smoother training dynamics
* Better **compute efficiency**, especially when your dataset has many short examples

When packing is enabled, we:

* Combine multiple samples from your dataset into a single sequence of length `context_length`
* Ensure that **no sample is truncated** (neither on the left nor on the right)

> Note: For some large MoE models, packing is **mandatory** and cannot be disabled.

For more background, see:

> *Efficient LLM Pretraining: Packed Sequences and Masked Attention*

If you want to implement your own packing strategy, you can do it client-side and upload the data in the **Pretokenized** dataset format, which supports segments via `attention_mask`.