You can fine-tune a generic model to adjust it to domain-specific tasks. By using fine-tuning in Nebius Token Factory, you can also save costs on training a model: using datasets costs cheaper than using numerous prompts.
Choose one of the models supported for fine-tuning.
Create a dataset for training.You can optionally create an additional dataset for validation. Split the data between two datasets as 80–90% for training and 10–20% for validation. Requirements for validation datasets are the same as for training datasets.
Upload a training and a validation dataset. The validation dataset is optional.
# Upload a training datasettraining_dataset = client.files.create( file=open("<dataset_name>.jsonl", "rb"), # Specify the dataset name purpose="fine-tune")# Upload a validation datasetvalidation_dataset = client.files.create( file=open("<dataset_name>.jsonl", "rb"), # Specify the dataset name purpose="fine-tune")
Configure Fine-tuning parameters.For more information about the tuning job parameters, see the specification of the fine-tuning job object.
# Create and run the fine-tuning jobjob = client.fine_tuning.jobs.create(**job_request)
Checks that the job status.
# Check for the job statusactive_statuses = ["validating_files", "queued", "running"]while job.status in active_statuses: time.sleep(15) job = client.fine_tuning.jobs.retrieve(job.id) print("current status is", job.status)print("Job ID:", job.id)
The status of a freshly started job is running. The script polls the status periodically to make sure that the job status has changed to succeeded. The minimum time window between subsequent polls is 15 seconds.If the status is failed, examine the output. It describes the error and how to fix it. If the error code is 500, resubmit the job.Checks that the training has been successful. To do this, check the job events. They are created when the job status changes.You can consider the training as finished if the response contains either the Dataset processed successfully or Training completed successfully message.
Retrieves the contents of the files with the fine-tuned model.
if job.status == "succeeded": # Check the job events events = client.fine_tuning.jobs.list_events(job.id) print(events) for checkpoint in client.fine_tuning.jobs.checkpoints.list(job.id).data: print("Checkpoint ID:", checkpoint.id) # Create a directory for every checkpoint os.makedirs(checkpoint.id, exist_ok=True) for model_file_id in checkpoint.result_files: # Get the name of a model file filename = client.files.retrieve(model_file_id).filename # Retrieve the contents of the file file_content = client.files.content(model_file_id) # Save the contents into a local file file_content.write_to_file(filename)
You get the files for every fine-tuning checkpoint. A checkpoint is created after every epoch of training a model, so you get intermediate results of the training. If you need final results, use the files from the last checkpoint.Saves the contents to files. The script creates a directory per checkpoint and saves the files into these directories.
Save the file ID; it is required to create a fine-tuning job.
Optionally to upload a file with validation dataset, use the same request as for the training dataset and save the file ID from the response.
Create a fine-tuning job by using the Nebius Token Factory API:
For more information about the fine-tuning job parameters, see the API specification of the fine-tuning job object below.Now, you can use these files to host the fine-tuned model and work with it.
Make sure that the job status is succeeded. To do this, request information about this job:
Specify the job ID in the endpoint.The status of a freshly started job is running. You can poll the status periodically to make sure that the job status has changed to succeeded. Do not send the requests more frequently than 15 seconds each.If the status is failed, examine the output. It describes the error and how to fix it. If the error code is 500, resubmit the job.
limit (integer, optional): Number of events to return.
after (string, optional): Pagination ID. Points to the event from which the response should continue.
You can consider the training as finished if the response contains either the Dataset processed successfully or Training completed successfully message.
To get the files with the fine-tuned model, first get a list of checkpoints. This list contains the IDs of the required files.A checkpoint is created after every epoch of training a model, so you can get intermediate results of the training. If you need final results, use the files from the last checkpoint.To get the checkpoint list, send the following request:
Use the filename field to save the files properly. For example, if file A has adapter_config.json as its filename, save this file contents as the adapter_config.json file.
Retrieve the contents of the files with the fine-tuned model:
Send this request for every file of the checkpoint that you need.
Copy the file contents from the response, and then save the files by using a proper name and extension (see the filenamefield in the output of the previous request).
Now, you can use these files to host the fine-tuned model and work with it.
suffix (string, optional): Suffix added to the model name (for example, my-modelor my-experiment). It helps you differentiate between fine-tuned models in their list.
training_file (string, required): ID of the file with the training dataset.For more information about how to prepare and upload datasets and how to get their IDs, see the following instructions:
batch_size (integer, optional): Number of training examples used in a batch for fine-tuning. A bigger batch size works better with bigger datasets.From 8 to 32. Default: 8.
learning_rate (float, optional): Learning rate for training. If you train a model in a domain in which the model has not been trained before, you may need a higher learning rate.Greater or equal to 0. Default: 0.00001.
n_epochs (integer, optional): Number of epochs to train on the dataset.An epoch is a cycle of going through the whole dataset for training. For example, if the number of epochs is 10, the model is trained on a given dataset 10 times.From 1 to 20. Default: 3.
warmup_ratio (float, optional): Percentage by which the learning rate should increase from the beginning of training.From 0 to 1. Default: 0.
weight_decay (float, optional): Weight decay value.Weight decay is a regularization technique that adds a penalty to the loss function and keeps fine-tuning weights small. This approach prevents overfitting and preserves generalization, so it is better suited for larger models or more complex tasks.Greater or equal to 0. Default: 0.
lora (boolean, optional): Whether to enable LoRA (Low-Rank Adaptation) for training.The LoRA method presumes that low-rank matrices are inserted into a pre-trained model. These matrices catch task-specific data during the training. As a result, you only train these matrices; you do not need to retrain the whole model and modify any preset fine-tuning parameters.If false, full fine-tuning is performed.Default: false.
lora_r (integer, optional): Rank for weights of LoRA adapters.A larger rank captures more pre-existing model weights for training. Eventually, the model is trained better, especially if it is trained for a task for which it has not been trained before. However, ranking too high can cause overfitting.From 8 to 128. Default: 8.
lora_alpha (integer, optional): Alpha value for training LoRA adapters.This parameter balances the influence of low-rank LoRA matrices on pre-existing model weights. If only a slight adjustment of a model is required, use a lower value.Greater or equal to 8. Default: 8.
lora_dropout (float, optional): LoRA dropout rate.LoRA dropout is a regularization technique that randomly omits a fraction of the model’s LoRA parameters during training. As a result, this technique helps avoid overfitting on the dataset, especially in cases when the dataset is small and the model should suit more general tasks.From 0 to 1. Default: 0.
packing (boolean, optional): Whether to use packing for training. With packing enabled, you can combine multiple small samples in a batch instead of having one sample per batch. This increases training efficiency.Default: true.
max_grad_norm (float, optional): Maximum gradient norm value used for gradient clipping.Make sure that the value is not too large or small:
A value that is too large causes the vanishing gradient problem. It happens when weight gradients become too small during backpropagation. As a result, the model cannot learn quickly enough.
A value that is too small causes the exploding gradient problem, which is the opposite to the vanishing gradient problem. Explosion happens when weight gradients get large. As a result, it leads to unstable and unoptimized training.
Greater or equal to 0. Default: 1.
seed (integer, optional): Control of the LLM output reproduction. If you pass along the same seed in different requests, you achieve approximately the same results.If you use the same seed but different values of other parameters, the results of your requests might differ.
integrations (array, optional): Integrations that Nebius Token Factory supports for fine-tuning:
type (string, optional): Integration type. The possible values are the following:
You can export the model training metrics to a project in Weights & Biases. Nebius Token Factory exports the metrics after you create a fine-tuning job. The service does not export system metrics or logs.
wandb (object, optional): Settings for the export to a project in Weights & Biases:
api_key (string, optional): API key from Weights & Biases. The key should be 40 characters long.
project (string, optional): Name of the project in Weights & Biases.
Was this page helpful?
⌘I
Assistant
Responses are generated using AI and may contain mistakes.