Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt

Use this file to discover all available pages before exploring further.

Dedicated endpoints provide isolated, configurable deployments of supported models and their performance templates. Use the control plane to create and manage deployments, and the data plane to run inference through OpenAI-compatible APIs. With dedicated endpoints, you control:

Region

Choose where your deployment runs to optimize latency and meet data residency requirements.

GPU configuration

Select GPU type and GPUs per replica to match your performance and throughput needs.

Autoscaling

Set minimum and maximum replicas to automatically scale capacity with traffic.

Lifecycle management

Create, update, stop, and delete deployments as your workloads evolve.

Key use cases:

  • predictable capacity
  • finetuned base model with custom weights
  • compliance / private infra
  • bigger control over deployment

Dedicated vs Public Endpoints Comparison

FeatureDedicated EndpointsPublic Serverless Endpoints
CapacityIsolated capacity reserved for your organizationShared multi-tenant capacity
Rate limitsNo standard rate limits; throughput depends on your deployed capacityDynamic rate limits apply
Data residencyDeployment region is fixed and user-selectedRegion may change based on available capacity
AutoscalingYou control minimum and maximum replicasPlatform-managed with predefined limits
Custom weights supportSupported for eligible modelsBase models only
PricingPer GPU/hour, billed with per-minute granularityPer token

Learn more

Deploy via API

Deploy via UI

FAQ