Skip to main content

Overview

Dedicated endpoints let you:
  • Launch an isolated instance of a model template (e.g., openai/gpt-oss-120b, deepseek-ai/DeepSeek-V3.1)
  • Define GPU type, region, and scaling behavior
  • Manage lifecycle (create, update, delete)
  • Access the endpoint via standard OpenAI-compatible APIs (/v1/chat/completions)
Each endpoint is tied to a template that defines which configurations it supports.

Authentication

All requests require a Bearer token:
Authorization: Bearer <YOUR_API_TOKEN>
Manage tokens from your Nebius Token Factory dashboard.

Base URL

https://api.tokenfactory.nebius.com

Regions

Dedicated endpoints can be deployed in multiple geographic regions. Region choice impacts latency, data locality, and regulatory compliance.

Available regions

  • eu-north1
  • eu-west1
  • us-central1
For US-hosted endpoints, using the US regional domain avoids unnecessary global routing and substantially lowers round-trip latency.

Default

If you do not specify a region, the system defaults to: eu-north1

1. List Available Templates

List all available model templates that can back a dedicated endpoint.

Request

GET /v0/dedicated_endpoints/templates
import requests, os, json

BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ.get("API_TOKEN")

r = requests.get(f"{BASE_URL}/v0/dedicated_endpoints/templates",
                 headers={"Authorization": f"Bearer {API_TOKEN}"})
print(json.dumps(r.json(), indent=2))
curl -s -H "Authorization: Bearer $API_TOKEN" \
  https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/templates

2. Create a Dedicated Endpoint

Spin up a dedicated endpoint from a model template.
The model must have a valid template (see previous section).

Request

POST /v0/dedicated_endpoints

Request body fields

FieldTypeDescription
namestringDisplay name for the endpoint
descriptionstringOptional description
model_namestringTemplate model name (e.g., openai/gpt-oss-120b)
flavor_namestringTemplate flavor (e.g., base, fast)
gpu_typestringGPU type supported by the template
gpu_countintegerNumber of GPUs per replica
regionstringDeployment region.
Options: eu-north1, eu-west1, us-central1.
Defaults to eu-north1.
scaling.min_replicasintegerMinimum replicas
scaling.max_replicasintegerMaximum replicas

Example

import requests, os, json

BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ.get("API_TOKEN")

payload = {
    "name": "GPT-120B Endpoint",
    "description": "Dedicated GPT-120B for internal apps",
    "model_name": "openai/gpt-oss-120b",
    "flavor_name": "base",
    "gpu_type": "gpu-h100-sxm",
    "gpu_count": 1,
    "region": "us-central1",
    "scaling": {"min_replicas": 1, "max_replicas": 2}
}

headers = {
    "Authorization": f"Bearer {API_TOKEN}",
    "Content-Type": "application/json"
}

r = requests.post(f"{BASE_URL}/v0/dedicated_endpoints", headers=headers, json=payload)
print(json.dumps(r.json(), indent=2))
curl -X POST https://api.tokenfactory.nebius.com/v0/dedicated_endpoints \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
        "name": "GPT-120B Endpoint",
        "description": "Dedicated GPT-120B for internal apps",
        "model_name": "openai/gpt-oss-120b",
        "flavor_name": "base",
        "gpu_type": "gpu-h100-sxm",
        "gpu_count": 1,
        "region": "us-central1",
        "scaling": {"min_replicas": 1, "max_replicas": 2}
      }'

Response

Includes endpoint metadata and the routing_key, which you must use during inference.

3. List Your Endpoints

Request

GET /v0/dedicated_endpoints
r = requests.get(f"{BASE_URL}/v0/dedicated_endpoints", headers=headers)
print(json.dumps(r.json(), indent=2))
curl -s -H "Authorization: Bearer $API_TOKEN" \
  https://api.tokenfactory.nebius.com/v0/dedicated_endpoints

4. Run Inference on Your Endpoint

Use the standard OpenAI-compatible API once the endpoint is ready. Startup usually takes several minutes.
from openai import OpenAI
import os

API_TOKEN = os.environ.get("API_TOKEN")
client = OpenAI(
    base_url="https://api.tokenfactory.nebius.com/v1/",
    api_key=API_TOKEN
)

response = client.chat.completions.create(
    model="dedicated/openai/gpt-oss-120b-<routing_key>",
    messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}]
)

print(response.choices[0].message.content)
curl https://api.tokenfactory.nebius.com/v1/chat/completions \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "dedicated/openai/gpt-oss-120b-<routing_key>",
    "messages": [{"role": "user", "content": "Hello from my dedicated endpoint!"}]
  }'

5. Update Endpoint Configuration

PATCH updates only the provided fields.
PATCH /v0/dedicated_endpoints/{endpoint_id}
payload = {
    "name": "GPT-120B Endpoint (updated)",
    "region": "eu-west1",
    "scaling": {"min_replicas": 2, "max_replicas": 4}
}

r = requests.patch(
    f"{BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers=headers,
    json=payload
)
print(json.dumps(r.json(), indent=2))
curl -X PATCH https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/<ENDPOINT_ID> \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"region": "eu-west1", "scaling": {"min_replicas": 2, "max_replicas": 4}}'

6. Delete an Endpoint

DELETE /v0/dedicated_endpoints/{endpoint_id}
This is permanent. Reserved GPUs are released immediately.
r = requests.delete(f"{BASE_URL}/v0/dedicated_endpoints/{endpoint_id}", headers=headers)
print(r.status_code)
curl -X DELETE https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/<ENDPOINT_ID> \
  -H "Authorization: Bearer $API_TOKEN"

Notes

  • Endpoint startup can take 10–15 minutes; during this time inference may return 404 Not Found.
  • Billing is tied to active GPU usage based on your scaling settings. Startup/shutdown are not billed.
  • Once live, the endpoint behaves like any OpenAI-style model under /v1/chat/completions.