How to create a dataset for fine-tuning

You can create datasets for training a model and validating the results of the training. Nebius Token Factory supports the JSON Lines format (.jsonl) for dataset files. You can use one of the following dataset types:

The size limit for dataset files is 5 gigabytes.

Conversational data

You can train a model by using chats. Pass along every chat as a single messages line. The following is an example of how a single messages parameter can look:

{
  "messages": [
    {"role": "system", "content": "This is a system prompt."},
    {"role": "user", "content": "What's for dinner tonight?"},
    {"role": "assistant", "content": "What cuisine do you prefer?"},
    {"role": "user", "content": "I wouldn't mind Italian."},
    {"role": "assistant", "content": "Then let's try a new pasta!"}
  ]
}

Examples of conversational datasets:

Instruction data

You can specify prompts and the expected answers to them:

{"prompt": "Capital of Australia", "completion": "Canberra"}
{"prompt": "Distance between Rome and Kuala Lumpur", "completion": "9703 km"}

Example of an instruction dataset: Andzej-75/German_RisingWorld_prompt-text-rejected_Jsonl.

Text data

If you have unstructured data, you can put each piece in a text line:

{"text": "assistant: How can I help you?\n user: I need my current balance statement.\n assistant: Please find your statement in the attachment."}
{"text": "user: Can I open a bank account?\n assistant: Sure! Please fill out the following form."}

Example of a text dataset: FranciscoMacaya/train_model.jsonl.

Get Started

AI Models Inference

Observability

Post-training

Data Lab

Teams & Access Management

Integrations

Conversational data

Instruction data

Text data

Get Started

AI Models Inference

Observability

Post-training

Data Lab

Teams & Access Management

Integrations

Documentation Index

​Conversational data

​Instruction data

​Text data

Conversational data

Instruction data

Text data