Overview
Dedicated endpoints let you run an isolated deployment of a supported models with control over:Region
Controls data residency and latency based on where the deployment runs
GPU Configuration
Defines GPU type and number of GPUs allocated per replica
Autoscaling
Set minimum and maximum replicas to handle traffic changes automatically
Lifecycle Management
Supports creating, updating, and deleting deployments
- Control plane to manage endpoints
- Data plane to run inference via OpenAI-compatible inference API
Key Concepts
| Term | Description |
|---|---|
| Template | A deployable performance “blueprint” for a model. Templates define which flavor_name, gpu_type, and regions are supported. |
| Flavor | A template’s sub-option (e.g. base, fast) with different performance/throughput/costs characteristics. |
| Endpoint | Dedicated deployment with API access |
| endpoint_id | Identifier used for update/delete operations. |
| routing_key | The model identifier you pass to inference calls. Returned when you create an endpoint. |
| Control plane | Sets up the configuration and settings of your endpoints. Has common base URL API. |
| Data plane | Processes model inference requests. Has regional base URL API. |
Control plane
Dedicated Endpoints Control Plane is a managedment layer for all configurations operations:- Creating & updating endpoints
- Uploading models / weights
- Scaling configs (min/max replicas)
- Monitoring setup
Data plane
Dedicated Ednpoints Data Plane processes model inference requests. It has regional base URL API. Region impacts latency, data locality, and regulatory compliance. Use a region-appropriate base URL for your inference calls:| Endpoint region | Inference base URL |
|---|---|
eu-north1 | https://api.tokenfactory.nebius.com |
eu-west1 | https://api.tokenfactory.eu-west1.nebius.com |
us-central1 | https://api.tokenfactory.us-central1.nebius.com |
Using the respective inference base URL avoids unnecessary global routing and reduces round-trip latency.
Launching & operating dedicated endpoint
List available model templates
List templates that can back a dedicated endpoint.Create a dedicated endpoint
Capacity and scaling
- Deployment depends on available GPU capacity. Some instance types may be temporarily unavailable.
- Minimum replicas are reserved and always available while the endpoint is running.
- Capacity used for scaling above the minimum is opportunistic. It is not preempted during active use, but may be reclaimed once the endpoint scales down.
| Field | Type | Required | Description |
|---|---|---|---|
name | string | yes | Display name for the endpoint |
description | string | no | Optional description |
model_name | string | yes | Template model name (e.g. openai/gpt-oss-120b) |
flavor_name | string | yes | Template flavor (e.g. base, fast) |
gpu_type | string | yes | GPU type supported by the chosen template + flavor |
gpu_count | integer | yes | gpu_count per replica. Total maximum GPUs = gpu_count × scaling.max_replicas |
region | string | yes | eu-north1, eu-west1, us-central1 |
scaling.min_replicas | integer | yes | Minimum replicas |
scaling.max_replicas | integer | yes | Maximum replicas |
endpoint_idrouting_key
Initial bring-up can take several minutes. While provisioning, inference may fail (often
404) until the endpoint is routable.Send inference requests
Once your endpoint is ready, send requests to the OpenAI-compatible data plane. Use therouting_key returned by the control plane as your model identifier:
We expose OpenAI-compatible inference routes under
/v1. Template availability determines what kinds of models you can deploy (today, publicly available templates are primarily chat-capable).Update Endpoint Configuration
PATCH updates only the provided fields.
| Field | Type | Description |
|---|---|---|
name | string | Display name |
description | string | Optional description |
enabled | boolean | Controls whether the dedicated endpoint is active. Only enabled endpoints are charged. |
gpu_type | string | Type of GPU |
gpu_count | integer | Number of GPU per replica |
scaling.min_replicas | integer | Minimum replicas |
scaling.max_replicas | integer | Maximum replicas |
Endpoint’s region config
region cannot be updated after dedicated endpointcreation.List dedicated endpoints
Delete endpoint
Deletes an endpoint permanently.Endpoint Observability
Observability
Check out Observability Documentation section here
Troubleshooting
| Status | Typical cause | Fix |
|---|---|---|
401 Unauthorized | Missing/invalid token | Ensure Authorization: Bearer ... is set and token is active |
403 Forbidden | Token lacks permission | Use a token with dedicated endpoint permissions |
404 Not Found | Endpoint not ready, wrong inference domain, or wrong routing_key | Wait for readiness; use correct inference base URL for the endpoint region; pass routing_key exactly as returned |
409 Conflict | Capacity or config conflict | Choose a supported GPU/region combo from templates or reduce max replicas |
422 Unprocessable Entity | Invalid payload values | Validate fields against templates (gpu types, flavors, regions, counts) |
Billing and Accessibility Policy for Dedicated Endpoints
- A dedicated endpoint is considered active, accessible, and billable when at least one replica is running
- When one or more replicas are running, the endpoint is available to serve traffic and billing charges apply
- When zero replicas are running, the endpoint is not accessible and billing charges do not apply
- Billing is driven by GPU capacity implied by your scaling settings. This may vary depending on your custom contract or work order.