Overview
Dedicated endpoints let you run an isolated deployment of a supported model template (for exampleopenai/gpt-oss-20b, deepseek-ai/DeepSeek-V3.1) with control over:
Region
Controls data residency and latency based on where the deployment runs
GPU Configuration
Defines GPU type and number of GPUs allocated per replica
Autoscaling
Set minimum and maximum replicas to handle traffic changes automatically
Lifecycle Management
Supports creating, updating, and deleting deployments
Key Concepts
| Term | Description |
|---|---|
| Template | A deployable “blueprint” for a model. Templates define which flavor_name, gpu_type, and regions are supported. |
| Flavor | A template variant (e.g. base, fast) with different performance/cost characteristics. |
| Endpoint | Your dedicated deployment created from a template. |
| endpoint_id | Identifier used for update/delete operations. |
| routing_key | The model identifier you pass to inference calls. Returned when you create an endpoint. |
Authentication
To authenticate, include your API key (e.g.,ABC123...) in the Authorization header, as shown below:
http://tokenfactory.nebius.com/
Base URLs
Control plane base URL
Use this for all dedicated endpoint management operations:Regions
Region impacts latency, data locality, and regulatory compliance.Available regions
Regions need to be specified explicitly.eu-north1eu-west1us-central1
Data plane base URL (inference)
Use a region-appropriate base URL for /v1 inference calls:| Endpoint region | Inference base URL |
|---|---|
eu-north1 | https://api.tokenfactory.nebius.com |
eu-west1 | https://api.tokenfactory.eu-west1.nebius.com |
us-central1 | https://api.tokenfactory.us-central1.nebius.com |
If your endpoint is deployed in
us-central1 or eu-west1, using the respective inference base URL avoids unnecessary global routing and reduces round-trip latency.1) List Available Templates
List templates that can back a dedicated endpoint.Request
Example
2) Create a Dedicated Endpoint
Create a dedicated endpoint from a model template.Request
Request body
| Field | Type | Required | Description |
|---|---|---|---|
name | string | yes | Display name for the endpoint |
description | string | no | Optional description |
model_name | string | yes | Template model name (e.g. openai/gpt-oss-120b) |
flavor_name | string | yes | Template flavor (e.g. base, fast) |
gpu_type | string | yes | GPU type supported by the chosen template + flavor |
gpu_count | integer | yes | GPUs per replica |
region | string | yes | eu-north1, eu-west1, us-central1 |
scaling.min_replicas | integer | yes | Minimum replicas |
scaling.max_replicas | integer | yes | Maximum replicas |
gpu_count is per replica. Total max GPUs = gpu_count × scaling.max_replicas.Example
Response
The create response includes endpoint metadata and (critically):endpoint_idrouting_key
3) List Your Endpoints
Request
Example
4) Run Inference on Your Endpoint
Once the endpoint is ready, call the OpenAI-compatible data plane under/v1.
Which /v1 base URL should I use?
- If your endpoint
regionisus-central1,
use:https://api.tokenfactory.us-central1.nebius.com - If your endpoint
regioniseu-west1,
use:https://api.tokenfactory.eu-west1.nebius.com - Otherwise, use:
https://api.tokenfactory.nebius.com
Chat completions example
Use therouting_key returned by the control plane:
model = "<routing_key>"
We expose OpenAI-compatible inference routes under
/v1. Template availability determines what kinds of models you can deploy (today, publicly available templates are primarily chat-capable).Example
5) Update Endpoint Configuration
PATCH updates only the provided fields.
Request
Updatable fields
| Field | Type | Description |
|---|---|---|
name | string | Display name |
description | string | Optional description |
enabled | boolean | Controls whether the dedicated endpoint is active. Only enabled endpoints are charged. |
gpu_type | string | Type of GPU |
gpu_count | integer | Number of GPU per replica |
scaling.min_replicas | integer | Minimum replicas |
scaling.max_replicas | integer | Maximum replicas |
region cannot be updated after creation.Example
6) Delete an endpoint
Deletes an endpoint permanently.Request
This is permanent. The GPUs associated with min_replicas are released for other users.
Example
Operational notes
- Provisioning time: initial bring-up can take several minutes. While provisioning, inference may fail (often
404) until the endpoint is routable. - Billing: driven by GPU capacity implied by your scaling settings. (Confirm exact billing rules in your org’s billing docs.)
- Inference APIs: OpenAI-compatible
/v1routes are available; which modalities/routes you can use depends on the templates you deploy.
Common errors & fixes
| Status | Typical cause | Fix |
|---|---|---|
401 Unauthorized | Missing/invalid token | Ensure Authorization: Bearer ... is set and token is active |
403 Forbidden | Token lacks permission | Use a token with dedicated endpoint permissions |
404 Not Found | Endpoint not ready, wrong inference domain, or wrong routing_key | Wait for readiness; use correct inference base URL for the endpoint region; pass routing_key exactly as returned |
409 Conflict | Capacity or config conflict | Choose a supported GPU/region combo from templates or reduce max replicas |
422 Unprocessable Entity | Invalid payload values | Validate fields against templates (gpu types, flavors, regions, counts) |