Documentation Index
Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
You can create datasets for training a model and validating the results of the training. Nebius Token Factory supports the JSON Lines format (.jsonl) for dataset files. You can use one of the following dataset types:
The size limit for dataset files is 5 gigabytes.
Conversational data
You can train a model by using chats. Pass along every chat as a single messages line. The following is an example of how a single messages parameter can look:
{
"messages": [
{"role": "system", "content": "This is a system prompt."},
{"role": "user", "content": "What's for dinner tonight?"},
{"role": "assistant", "content": "What cuisine do you prefer?"},
{"role": "user", "content": "I wouldn't mind Italian."},
{"role": "assistant", "content": "Then let's try a new pasta!"}
]
}
Examples of conversational datasets:
Instruction data
You can specify prompts and the expected answers to them:
{"prompt": "Capital of Australia", "completion": "Canberra"}
{"prompt": "Distance between Rome and Kuala Lumpur", "completion": "9703 km"}
Example of an instruction dataset: Andzej-75/German_RisingWorld_prompt-text-rejected_Jsonl.
Text data
If you have unstructured data, you can put each piece in a text line:
{"text": "assistant: How can I help you?\n user: I need my current balance statement.\n assistant: Please find your statement in the attachment."}
{"text": "user: Can I open a bank account?\n assistant: Sure! Please fill out the following form."}
Example of a text dataset: FranciscoMacaya/train_model.jsonl.