Batch Inference

Batch Inference enables asynchronous processing of large datasets without requiring real-time responses. It is designed for workloads where throughput and cost efficiency are more important than latency. For more information, see the Batch Inference documentation.

When to use batch inference

Batch inference is ideal when:

You need to process thousands or millions of inputs
Results are not needed immediately
You want predictable cost and execution
You want to reuse outputs downstream (evaluation, training, analysis)

How batch inference works

Input dataset: you select an existing dataset from Data Lab or upload a new one.
Job configuration: you choose the model and inference parameters.
Execution: the job is queued and processed asynchronously when capacity is available.
Results storage: outputs are written back into Data Lab as a new dataset.

Batch inference outputs

Each output record typically contains:

Prompt
Completion (model-generated response)
Completed dialogue
Raw prompt and response
List of chosen parameters
Token usage
Execution status (success, error)

These outputs can be:

Inspected inside Data Lab
Filtered or transformed
Exported
Used as input for additional batch jobs or fine-tuning

Common use cases

Offline prompt evaluation
Content generation at scale
Dataset labeling and augmentation

Batch inference example

This example shows how to upload a dataset, run asynchronous batch inference by mapping dataset columns to model inputs, track the job status, and export the generated results once processing is complete.

import time

import requests
import os


# initialize all necessary variables
base_url = "https://api.tokenfactory.nebius.com"
token = os.environ["NEBIUS_API_KEY"]
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {token}",
}
records = [
    {
        "prompt": [
            {
                "role": "assistant",
                "content": "You are a helpful assistant.",
            },
            {
                "role": "user",
                "content": "What is a capital of The Netherlands?",
            },
        ],
        "custom_id": "1",
    },
    {
        "prompt": [
            {
                "role": "assistant",
                "content": [
                    {
                        "text": "You are a helpful assistant.",
                        "type": "text",
                    },
                ],
            },
            {
                "role": "user",
                "content": "What should I do if it’s raining and I forgot my umbrella?",
            },
        ],
        "custom_id": "3",
    },
]

# upload dataset
# it's possible to upload arbitrarily datasets with arbitrary column names and use them in inference or fine-tuning
response = requests.post(
    f"{base_url}/v1/datasets",
    json={
        "name": "Example Batch Inference Dataset",
        "dataset_schema": [
            {
                "name": "prompt",
                "type": {
                    "name": "json"
                },
            },
            {
                "name": "custom_id",
                "type": {
                    "name": "string",
                },
            },
        ],
        "folder": "/demo",
        "rows": records,
    },
    headers=headers,
)
response.raise_for_status()

print("Dataset uploaded:")
print(response.json())
source_dataset_id = response.json()["id"]
source_dataset_version_id = response.json()["current_version"]

# run batch inference
# here we are able to use arbitrary columns from the dataset uploaded above using the mapping
response = requests.post(
    f"{base_url}/v1/operations",
    json={
        "type": "batch_inference",
        "src": [
            {
                "id": source_dataset_id,
                "version": source_dataset_version_id,
                "mapping": {
                    "type": "text_messages",
                    "messages": {
                        "type": "column",
                        "name": "prompt",  # you can use any column that contains JSON in the appropriate format
                    },
                    "custom_id": {  # optional
                        "type": "column",
                        "name": "custom_id",
                    },
                    "max_tokens": {  # optional
                        "type": "text",
                        "value": "32000",
                    }
                },
            },
        ],
        "dst": [],
        "params": {
            "model": "openai/gpt-oss-20b",
            "completion_window": "12h",
        },
    },
    headers=headers,
)
response.raise_for_status()
print("Batch inference started:")
print(response.json())
dst_dataset_id = response.json()["dst"][0]["id"]
operation_id = response.json()["id"]

# wait for operation to complete
is_running = True
while is_running:
    status_response = requests.get(
        f"{base_url}/v1/operations/{operation_id}",
        headers=headers,
    )
    status_response.raise_for_status()
    status_data = status_response.json()
    is_running = status_data["status"] in {"queued", "running"}
    print(f"Operation status: {status_data['status']}")
    time.sleep(5)

# download results
response = requests.get(
    f"{base_url}/v1/datasets/{dst_dataset_id}/export?format=jsonl",  # csv is also supported, use limit and offset for big datasets
    headers=headers,
)
response.raise_for_status()
print("Batch inference results:")
print(response.text)

Get Started

AI Models Inference

Post-training

Data Lab

Utilities

Teams & Access Management

Other Capabilities

Integrations

When to use batch inference

How batch inference works

Batch inference outputs

Common use cases

Batch inference example

Get Started

AI Models Inference

Post-training

Data Lab

Utilities

Teams & Access Management

Other Capabilities

Integrations

​When to use batch inference

​How batch inference works

​Batch inference outputs

​Common use cases

​Batch inference example

When to use batch inference

How batch inference works

Batch inference outputs

Common use cases

Batch inference example