Skip to main content

Overview

Dedicated endpoints let you run an isolated deployment of a supported models with control over:

Region

Controls data residency and latency based on where the deployment runs

GPU Configuration

Defines GPU type and number of GPUs allocated per replica

Autoscaling

Set minimum and maximum replicas to handle traffic changes automatically

Lifecycle Management

Supports creating, updating, and deleting deployments
You can use:

Key Concepts

TermDescription
TemplateA deployable performance “blueprint” for a model. Templates define which flavor_name, gpu_type, and regions are supported.
FlavorA template’s sub-option (e.g. base, fast) with different performance/throughput/costs characteristics.
EndpointDedicated deployment with API access
endpoint_idIdentifier used for update/delete operations.
routing_keyThe model identifier you pass to inference calls. Returned when you create an endpoint.
Control planeSets up the configuration and settings of your endpoints. Has common base URL API.
Data planeProcesses model inference requests. Has regional base URL API.

Control plane

Dedicated Endpoints Control Plane is a managedment layer for all configurations operations:
  • Creating & updating endpoints
  • Uploading models / weights
  • Scaling configs (min/max replicas)
  • Monitoring setup
Use common base URL API for all dedicated endpoint management operations:
https://api.tokenfactory.nebius.com

Data plane

Dedicated Ednpoints Data Plane processes model inference requests. It has regional base URL API. Region impacts latency, data locality, and regulatory compliance. Use a region-appropriate base URL for your inference calls:
Endpoint regionInference base URL
eu-north1https://api.tokenfactory.nebius.com
eu-west1https://api.tokenfactory.eu-west1.nebius.com
us-central1https://api.tokenfactory.us-central1.nebius.com
Using the respective inference base URL avoids unnecessary global routing and reduces round-trip latency.

Launching & operating dedicated endpoint

1

List available model templates

2

Create dedicated endpoint

3

Send inference requests

List available model templates

List templates that can back a dedicated endpoint.
GET /v0/dedicated_endpoints/templates
import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

r = requests.get(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/templates",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))
Use the template response as the source of truth for valid combinations of model_name, flavor_name, gpu_type, and region.

Create a dedicated endpoint

Capacity and scaling
  • Deployment depends on available GPU capacity. Some instance types may be temporarily unavailable.
  • Minimum replicas are reserved and always available while the endpoint is running.
  • Capacity used for scaling above the minimum is opportunistic. It is not preempted during active use, but may be reclaimed once the endpoint scales down.
If you encounter capacity errors or are unable to scale beyond the minimum number of replicas, contact our Sales team to reserve dedicated GPU capacity.
Create a dedicated endpoint from a model template.
POST /v0/dedicated_endpoints
Request body:
FieldTypeRequiredDescription
namestringyesDisplay name for the endpoint
descriptionstringnoOptional description
model_namestringyesTemplate model name (e.g. openai/gpt-oss-120b)
flavor_namestringyesTemplate flavor (e.g. base, fast)
gpu_typestringyesGPU type supported by the chosen template + flavor
gpu_countintegeryesgpu_count per replica. Total maximum GPUs = gpu_count × scaling.max_replicas
regionstringyeseu-north1, eu-west1, us-central1
scaling.min_replicasintegeryesMinimum replicas
scaling.max_replicasintegeryesMaximum replicas
import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

payload = {
    "name": "GPT-20B Endpoint",
    "description": "Dedicated GPT-20B for internal apps",
    "model_name": "openai/gpt-oss-20b",
    "flavor_name": "base",
    "gpu_type": "gpu-h100-sxm",
    "gpu_count": 1,
    "region": "eu-north1",
    "scaling": {"min_replicas": 1, "max_replicas": 2},
}

r = requests.post(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
    json=payload,
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))
The create response includes endpoint metadata:
  • endpoint_id
  • routing_key
Initial bring-up can take several minutes. While provisioning, inference may fail (often 404) until the endpoint is routable.

Send inference requests

Once your endpoint is ready, send requests to the OpenAI-compatible data plane. Use the routing_key returned by the control plane as your model identifier:
model = "<routing_key>"
Use a region-appropriate base URL for your inference calls. Example request URL:
https://api.tokenfactory.eu-west1.nebius.com/v1/chat/completions
import os
from openai import OpenAI

API_TOKEN = os.environ["API_TOKEN"]

# Choose based on the endpoint region
INFERENCE_BASE_URL = "https://api.tokenfactory.us-central1.nebius.com/v1"

client = OpenAI(
    base_url=INFERENCE_BASE_URL,
    api_key=API_TOKEN,
)

routing_key = "<routing_key>"  # use exactly what the API returned
response = client.chat.completions.create(
    model=routing_key,
    messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}],
)

print(response.choices[0].message.content)
We expose OpenAI-compatible inference routes under /v1. Template availability determines what kinds of models you can deploy (today, publicly available templates are primarily chat-capable).

Update Endpoint Configuration

PATCH updates only the provided fields.
PATCH /v0/dedicated_endpoints/{endpoint_id}
Updatable fields:
FieldTypeDescription
namestringDisplay name
descriptionstringOptional description
enabledbooleanControls whether the dedicated endpoint is active. Only enabled endpoints are charged.
gpu_typestringType of GPU
gpu_countintegerNumber of GPU per replica
scaling.min_replicasintegerMinimum replicas
scaling.max_replicasintegerMaximum replicas
Endpoint’s region config region cannot be updated after dedicated endpointcreation.
import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]
endpoint_id = "<endpoint_id>"

payload = {
    "name": "GPT-20B Endpoint (updated)",
    "scaling": {"min_replicas": 2, "max_replicas": 4},
}

r = requests.patch(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
    json=payload,
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))
Scaling changes may trigger provisioning of additional replicas. Plan for a short warm-up period where new replicas come online.

List dedicated endpoints

GET /v0/dedicated_endpoints
import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

r = requests.get(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))

Delete endpoint

Deletes an endpoint permanently.
DELETE /v0/dedicated_endpoints/{endpoint_id}
This is a permanent action. The GPUs associated with min_replicas are released for other users.
import os, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]
endpoint_id = "<endpoint_id>"

r = requests.delete(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
print(r.status_code)
r.raise_for_status()

Endpoint Observability

Observability

Check out Observability Documentation section here

Troubleshooting

StatusTypical causeFix
401 UnauthorizedMissing/invalid tokenEnsure Authorization: Bearer ... is set and token is active
403 ForbiddenToken lacks permissionUse a token with dedicated endpoint permissions
404 Not FoundEndpoint not ready, wrong inference domain, or wrong routing_keyWait for readiness; use correct inference base URL for the endpoint region; pass routing_key exactly as returned
409 ConflictCapacity or config conflictChoose a supported GPU/region combo from templates or reduce max replicas
422 Unprocessable EntityInvalid payload valuesValidate fields against templates (gpu types, flavors, regions, counts)

Billing and Accessibility Policy for Dedicated Endpoints

  • A dedicated endpoint is considered active, accessible, and billable when at least one replica is running
  • When one or more replicas are running, the endpoint is available to serve traffic and billing charges apply
  • When zero replicas are running, the endpoint is not accessible and billing charges do not apply
  • Billing is driven by GPU capacity implied by your scaling settings. This may vary depending on your custom contract or work order.