Dedicated endpoints provide isolated, configurable deployments of supported models and their performance templates. Use the control plane to create and manage deployments, and the data plane to run inference through OpenAI-compatible APIs. With dedicated endpoints, you control:Documentation Index
Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Region
Choose where your deployment runs to optimize latency and meet data residency requirements.
GPU configuration
Select GPU type and GPUs per replica to match your performance and throughput needs.
Autoscaling
Set minimum and maximum replicas to automatically scale capacity with traffic.
Lifecycle management
Create, update, stop, and delete deployments as your workloads evolve.
Key use cases:
- predictable capacity
- finetuned base model with custom weights
- compliance / private infra
- bigger control over deployment
Dedicated vs Public Endpoints Comparison
| Feature | Dedicated Endpoints | Public Serverless Endpoints |
|---|---|---|
| Capacity | Isolated capacity reserved for your organization | Shared multi-tenant capacity |
| Rate limits | No standard rate limits; throughput depends on your deployed capacity | Dynamic rate limits apply |
| Data residency | Deployment region is fixed and user-selected | Region may change based on available capacity |
| Autoscaling | You control minimum and maximum replicas | Platform-managed with predefined limits |
| Custom weights support | Supported for eligible models | Base models only |
| Pricing | Per GPU/hour, billed with per-minute granularity | Per token |