Requests sent in batches cost 50% less than regular requests to base models. The price of batch inference does not depend on the model flavor that you use. Batch inference does not consume tokens from per-model rate limits. All batch requests are processed asynchronously, with most completed within 24 hours.Documentation Index
Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Prepare a batch file
Prepare a file with JSON lines (JSONL), with each line representing a request to a single model through the API. We will usebatch-requests.jsonl as an example.
The file contents format:
custom_id: Unique ID to refer to the inference results.url: API endpoint. Available endpoints are/v1/chat/completionsand/v1/embeddings.body.model: Model ID. The ID should be the same across the file.
- Up to 5,000,000 requests.
- Up to 10 GB in size.
Upload the file
Before uploading the file, check that your API key is saved to theNEBIUS_API_KEY environment variable.
Create a batch
You can create up to 500 batches.endpoint: Endpoint matching the one from your JSONL file.completion_window: Time period within the batch will be processed. We support only24hcompletion window.
Get the batch status
A batch can be completed sooner than in 24 hours. To check the completion status, refer to the batch API.Get the results
When the batch status changes tocompleted, copy the output_file_id from the response and download the file with the results from the files API.
custom_id of the corresponding lines.
To view unsuccessful requests, get your batch status, copy the error_file_id from the response and download the file with the failed requests lines.
Cancel a batch
To cancel the outgoing batch, use the batch API.cancelling status and the number of completed and failed requests.