Skip to main content

Overview

Dedicated endpoints let you run an isolated deployment of a supported model template (for example openai/gpt-oss-20b, deepseek-ai/DeepSeek-V3.1) with control over:

Region

Controls data residency and latency based on where the deployment runs

GPU Configuration

Defines GPU type and number of GPUs allocated per replica

Autoscaling

Set minimum and maximum replicas to handle traffic changes automatically

Lifecycle Management

Supports creating, updating, and deleting deployments
Once deployed, you can run inference on your endpoint using our OpenAI-compatible inference API.

Key Concepts

TermDescription
TemplateA deployable “blueprint” for a model. Templates define which flavor_name, gpu_type, and regions are supported.
FlavorA template variant (e.g. base, fast) with different performance/cost characteristics.
EndpointYour dedicated deployment created from a template.
endpoint_idIdentifier used for update/delete operations.
routing_keyThe model identifier you pass to inference calls. Returned when you create an endpoint.

Authentication

To authenticate, include your API key (e.g., ABC123...) in the Authorization header, as shown below:
Authorization: Bearer ABC123...
Manage tokens in your Nebius Token Factory dashboard:
  • http://tokenfactory.nebius.com/
Keep your keys private; do not share or expose them in client-side code. If a key is compromised, Nebius Token Factory can automatically revoke it

Base URLs

Control plane base URL

Use this for all dedicated endpoint management operations:
https://api.tokenfactory.nebius.com

Regions

Region impacts latency, data locality, and regulatory compliance.

Available regions

Regions need to be specified explicitly.
  • eu-north1
  • eu-west1
  • us-central1

Data plane base URL (inference)

Use a region-appropriate base URL for /v1 inference calls:
Endpoint regionInference base URL
eu-north1https://api.tokenfactory.nebius.com
eu-west1https://api.tokenfactory.eu-west1.nebius.com
us-central1https://api.tokenfactory.us-central1.nebius.com
If your endpoint is deployed in us-central1 or eu-west1, using the respective inference base URL avoids unnecessary global routing and reduces round-trip latency.

1) List Available Templates

List templates that can back a dedicated endpoint.

Request

GET /v0/dedicated_endpoints/templates

Example

import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

r = requests.get(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/templates",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))
Use the template response as the source of truth for valid combinations of model_name, flavor_name, gpu_type, and region.

2) Create a Dedicated Endpoint

Create a dedicated endpoint from a model template.

Request

POST /v0/dedicated_endpoints

Request body

FieldTypeRequiredDescription
namestringyesDisplay name for the endpoint
descriptionstringnoOptional description
model_namestringyesTemplate model name (e.g. openai/gpt-oss-120b)
flavor_namestringyesTemplate flavor (e.g. base, fast)
gpu_typestringyesGPU type supported by the chosen template + flavor
gpu_countintegeryesGPUs per replica
regionstringyeseu-north1, eu-west1, us-central1
scaling.min_replicasintegeryesMinimum replicas
scaling.max_replicasintegeryesMaximum replicas
gpu_count is per replica. Total max GPUs = gpu_count × scaling.max_replicas.

Example

import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

payload = {
    "name": "GPT-20B Endpoint",
    "description": "Dedicated GPT-20B for internal apps",
    "model_name": "openai/gpt-oss-20b",
    "flavor_name": "base",
    "gpu_type": "gpu-h100-sxm",
    "gpu_count": 1,
    "region": "eu-north1",
    "scaling": {"min_replicas": 1, "max_replicas": 2},
}

r = requests.post(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
    json=payload,
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))

Response

The create response includes endpoint metadata and (critically):
  • endpoint_id
  • routing_key

3) List Your Endpoints

Request

GET /v0/dedicated_endpoints

Example

import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

r = requests.get(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))

4) Run Inference on Your Endpoint

Once the endpoint is ready, call the OpenAI-compatible data plane under /v1.

Which /v1 base URL should I use?

  • If your endpoint region is us-central1,
    use: https://api.tokenfactory.us-central1.nebius.com
  • If your endpoint region is eu-west1,
    use: https://api.tokenfactory.eu-west1.nebius.com
  • Otherwise, use: https://api.tokenfactory.nebius.com

Chat completions example

Use the routing_key returned by the control plane:
  • model = "<routing_key>"
We expose OpenAI-compatible inference routes under /v1. Template availability determines what kinds of models you can deploy (today, publicly available templates are primarily chat-capable).

Example

import os
from openai import OpenAI

API_TOKEN = os.environ["API_TOKEN"]

# Choose based on the endpoint region
INFERENCE_BASE_URL = "https://api.tokenfactory.us-central1.nebius.com/v1"

client = OpenAI(
    base_url=INFERENCE_BASE_URL,
    api_key=API_TOKEN,
)

routing_key = "<routing_key>"  # use exactly what the API returned
response = client.chat.completions.create(
    model=routing_key,
    messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}],
)

print(response.choices[0].message.content)

5) Update Endpoint Configuration

PATCH updates only the provided fields.

Request

PATCH /v0/dedicated_endpoints/{endpoint_id}

Updatable fields

FieldTypeDescription
namestringDisplay name
descriptionstringOptional description
enabledbooleanControls whether the dedicated endpoint is active. Only enabled endpoints are charged.
gpu_typestringType of GPU
gpu_countintegerNumber of GPU per replica
scaling.min_replicasintegerMinimum replicas
scaling.max_replicasintegerMaximum replicas
region cannot be updated after creation.

Example

import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]
endpoint_id = "<endpoint_id>"

payload = {
    "name": "GPT-20B Endpoint (updated)",
    "scaling": {"min_replicas": 2, "max_replicas": 4},
}

r = requests.patch(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
    json=payload,
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))
Scaling changes may trigger provisioning of additional replicas. Plan for a short warm-up period where new replicas come online.

6) Delete an endpoint

Deletes an endpoint permanently.

Request

DELETE /v0/dedicated_endpoints/{endpoint_id}
This is permanent. The GPUs associated with min_replicas are released for other users.

Example

import os, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]
endpoint_id = "<endpoint_id>"

r = requests.delete(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
print(r.status_code)
r.raise_for_status()

Operational notes

  • Provisioning time: initial bring-up can take several minutes. While provisioning, inference may fail (often 404) until the endpoint is routable.
  • Billing: driven by GPU capacity implied by your scaling settings. (Confirm exact billing rules in your org’s billing docs.)
  • Inference APIs: OpenAI-compatible /v1 routes are available; which modalities/routes you can use depends on the templates you deploy.

Common errors & fixes

StatusTypical causeFix
401 UnauthorizedMissing/invalid tokenEnsure Authorization: Bearer ... is set and token is active
403 ForbiddenToken lacks permissionUse a token with dedicated endpoint permissions
404 Not FoundEndpoint not ready, wrong inference domain, or wrong routing_keyWait for readiness; use correct inference base URL for the endpoint region; pass routing_key exactly as returned
409 ConflictCapacity or config conflictChoose a supported GPU/region combo from templates or reduce max replicas
422 Unprocessable EntityInvalid payload valuesValidate fields against templates (gpu types, flavors, regions, counts)