Skip to main content
Batch Inference enables asynchronous processing of large datasets without requiring real-time responses. It is designed for workloads where throughput and cost efficiency are more important than latency. For more information, see the Batch Inference documentation.

When to use batch inference

Batch inference is ideal when:
  • You need to process thousands or millions of inputs
  • Results are not needed immediately
  • You want predictable cost and execution
  • You want to reuse outputs downstream (evaluation, training, analysis)

How batch inference works

  1. Input dataset: you select an existing dataset from Data Lab or upload a new one.
  2. Job configuration: you choose the model and inference parameters.
  3. Execution: the job is queued and processed asynchronously when capacity is available.
  4. Results storage: outputs are written back into Data Lab as a new dataset.

Batch inference outputs

Each output record typically contains:
  • Prompt
  • Completion (model-generated response)
  • Completed dialogue
  • Raw prompt and response
  • List of chosen parameters
  • Token usage
  • Execution status (success, error)
These outputs can be:
  • Inspected inside Data Lab
  • Filtered or transformed
  • Exported
  • Used as input for additional batch jobs or fine-tuning

Common use cases

  • Offline prompt evaluation
  • Content generation at scale
  • Dataset labeling and augmentation

Batch inference example

This example shows how to upload a dataset, run asynchronous batch inference by mapping dataset columns to model inputs, track the job status, and export the generated results once processing is complete.
import time

import requests
import os


# initialize all necessary variables
base_url = "https://api.tokenfactory.nebius.com"
token = os.environ["NEBIUS_API_KEY"]
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {token}",
}
records = [
    {
        "prompt": [
            {
                "role": "assistant",
                "content": "You are a helpful assistant.",
            },
            {
                "role": "user",
                "content": "What is a capital of The Netherlands?",
            },
        ],
        "custom_id": "1",
    },
    {
        "prompt": [
            {
                "role": "assistant",
                "content": [
                    {
                        "text": "You are a helpful assistant.",
                        "type": "text",
                    },
                ],
            },
            {
                "role": "user",
                "content": "What should I do if it’s raining and I forgot my umbrella?",
            },
        ],
        "custom_id": "3",
    },
]

# upload dataset
# it's possible to upload arbitrarily datasets with arbitrary column names and use them in inference or fine-tuning
response = requests.post(
    f"{base_url}/v1/datasets",
    json={
        "name": "Example Batch Inference Dataset",
        "dataset_schema": [
            {
                "name": "prompt",
                "type": {
                    "name": "json"
                },
            },
            {
                "name": "custom_id",
                "type": {
                    "name": "string",
                },
            },
        ],
        "folder": "/demo",
        "rows": records,
    },
    headers=headers,
)
response.raise_for_status()

print("Dataset uploaded:")
print(response.json())
source_dataset_id = response.json()["id"]
source_dataset_version_id = response.json()["current_version"]

# run batch inference
# here we are able to use arbitrary columns from the dataset uploaded above using the mapping
response = requests.post(
    f"{base_url}/v1/operations",
    json={
        "type": "batch_inference",
        "src": [
            {
                "id": source_dataset_id,
                "version": source_dataset_version_id,
                "mapping": {
                    "type": "text_messages",
                    "messages": {
                        "type": "column",
                        "name": "prompt",  # you can use any column that contains JSON in the appropriate format
                    },
                    "custom_id": {  # optional
                        "type": "column",
                        "name": "custom_id",
                    },
                    "max_tokens": {  # optional
                        "type": "text",
                        "value": "32000",
                    }
                },
            },
        ],
        "dst": [],
        "params": {
            "model": "openai/gpt-oss-20b",
            "completion_window": "12h",
        },
    },
    headers=headers,
)
response.raise_for_status()
print("Batch inference started:")
print(response.json())
dst_dataset_id = response.json()["dst"][0]["id"]
operation_id = response.json()["id"]

# wait for operation to complete
is_running = True
while is_running:
    status_response = requests.get(
        f"{base_url}/v1/operations/{operation_id}",
        headers=headers,
    )
    status_response.raise_for_status()
    status_data = status_response.json()
    is_running = status_data["status"] in {"queued", "running"}
    print(f"Operation status: {status_data['status']}")
    time.sleep(5)

# download results
response = requests.get(
    f"{base_url}/v1/datasets/{dst_dataset_id}/export?format=jsonl",  # csv is also supported, use limit and offset for big datasets
    headers=headers,
)
response.raise_for_status()
print("Batch inference results:")
print(response.text)