Skip to main content Batch Inference enables asynchronous processing of large datasets without requiring real-time responses. It is designed for workloads where throughput and cost efficiency are more important than latency.
When to use batch inference
Batch inference is ideal when:
You need to process thousands or millions of inputs
Results are not needed immediately
You want predictable cost and execution
You want to reuse outputs downstream (evaluation, training, analysis)
How batch inference works
Input dataset: you select an existing dataset from Data Lab or upload a new one.
Job configuration: you choose the model and inference parameters.
Execution: the job is queued and processed asynchronously when capacity is available.
Results storage: outputs are written back into Data Lab as a new dataset.
Batch inference outputs
Each output record typically contains:
Prompt
Completion (model-generated response)
Completed dialogue
Raw prompt and response
List of chosen parameters
Token usage
Execution status (success, error)
These outputs can be:
Inspected inside Data Lab
Filtered or transformed
Exported
Used as input for additional batch jobs or fine-tuning
Common use cases
Offline prompt evaluation
Content generation at scale
Dataset labeling and augmentation