Skip to main content
Batch Inference enables asynchronous processing of large datasets without requiring real-time responses. It is designed for workloads where throughput and cost efficiency are more important than latency.

When to use batch inference

Batch inference is ideal when:
  • You need to process thousands or millions of inputs
  • Results are not needed immediately
  • You want predictable cost and execution
  • You want to reuse outputs downstream (evaluation, training, analysis)

How batch inference works

  1. Input dataset: you select an existing dataset from Data Lab or upload a new one.
  2. Job configuration: you choose the model and inference parameters.
  3. Execution: the job is queued and processed asynchronously when capacity is available.
  4. Results storage: outputs are written back into Data Lab as a new dataset.

Batch inference outputs

Each output record typically contains:
  • Prompt
  • Completion (model-generated response)
  • Completed dialogue
  • Raw prompt and response
  • List of chosen parameters
  • Token usage
  • Execution status (success, error)
These outputs can be:
  • Inspected inside Data Lab
  • Filtered or transformed
  • Exported
  • Used as input for additional batch jobs or fine-tuning

Common use cases

  • Offline prompt evaluation
  • Content generation at scale
  • Dataset labeling and augmentation