Overview
Dedicated endpoints let you:
- Launch an isolated instance of a model template (e.g.,
openai/gpt-oss-120b, deepseek-ai/DeepSeek-V3.1)
- Define GPU type, region, and scaling behavior
- Manage lifecycle (create, update, delete)
- Access the endpoint via standard OpenAI-compatible APIs (
/v1/chat/completions)
Each endpoint is tied to a template that defines which configurations it supports.
Authentication
All requests require a Bearer token:
Authorization: Bearer <YOUR_API_TOKEN>
Manage tokens from your Nebius Token Factory dashboard.
Base URL
https://api.tokenfactory.nebius.com
Regions
Dedicated endpoints can be deployed in multiple geographic regions. Region choice impacts latency, data locality, and regulatory compliance.
Available regions
eu-north1
eu-west1
us-central1
For US-hosted endpoints, using the US regional domain avoids unnecessary global routing and substantially lowers round-trip latency.
Default
If you do not specify a region, the system defaults to: eu-north1
1. List Available Templates
List all available model templates that can back a dedicated endpoint.
Request
GET /v0/dedicated_endpoints/templates
import requests, os, json
BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ.get("API_TOKEN")
r = requests.get(f"{BASE_URL}/v0/dedicated_endpoints/templates",
headers={"Authorization": f"Bearer {API_TOKEN}"})
print(json.dumps(r.json(), indent=2))
curl -s -H "Authorization: Bearer $API_TOKEN" \
https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/templates
2. Create a Dedicated Endpoint
Spin up a dedicated endpoint from a model template.
The model must have a valid template (see previous section).
Request
POST /v0/dedicated_endpoints
Request body fields
| Field | Type | Description |
|---|
name | string | Display name for the endpoint |
description | string | Optional description |
model_name | string | Template model name (e.g., openai/gpt-oss-120b) |
flavor_name | string | Template flavor (e.g., base, fast) |
gpu_type | string | GPU type supported by the template |
gpu_count | integer | Number of GPUs per replica |
region | string | Deployment region. Options: eu-north1, eu-west1, us-central1. Defaults to eu-north1. |
scaling.min_replicas | integer | Minimum replicas |
scaling.max_replicas | integer | Maximum replicas |
Example
import requests, os, json
BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ.get("API_TOKEN")
payload = {
"name": "GPT-120B Endpoint",
"description": "Dedicated GPT-120B for internal apps",
"model_name": "openai/gpt-oss-120b",
"flavor_name": "base",
"gpu_type": "gpu-h100-sxm",
"gpu_count": 1,
"region": "us-central1",
"scaling": {"min_replicas": 1, "max_replicas": 2}
}
headers = {
"Authorization": f"Bearer {API_TOKEN}",
"Content-Type": "application/json"
}
r = requests.post(f"{BASE_URL}/v0/dedicated_endpoints", headers=headers, json=payload)
print(json.dumps(r.json(), indent=2))
curl -X POST https://api.tokenfactory.nebius.com/v0/dedicated_endpoints \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "GPT-120B Endpoint",
"description": "Dedicated GPT-120B for internal apps",
"model_name": "openai/gpt-oss-120b",
"flavor_name": "base",
"gpu_type": "gpu-h100-sxm",
"gpu_count": 1,
"region": "us-central1",
"scaling": {"min_replicas": 1, "max_replicas": 2}
}'
Response
Includes endpoint metadata and the routing_key, which you must use during inference.
3. List Your Endpoints
Request
GET /v0/dedicated_endpoints
r = requests.get(f"{BASE_URL}/v0/dedicated_endpoints", headers=headers)
print(json.dumps(r.json(), indent=2))
curl -s -H "Authorization: Bearer $API_TOKEN" \
https://api.tokenfactory.nebius.com/v0/dedicated_endpoints
4. Run Inference on Your Endpoint
Use the standard OpenAI-compatible API once the endpoint is ready. Startup usually takes several minutes.
from openai import OpenAI
import os
API_TOKEN = os.environ.get("API_TOKEN")
client = OpenAI(
base_url="https://api.tokenfactory.nebius.com/v1/",
api_key=API_TOKEN
)
response = client.chat.completions.create(
model="dedicated/openai/gpt-oss-120b-<routing_key>",
messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}]
)
print(response.choices[0].message.content)
curl https://api.tokenfactory.nebius.com/v1/chat/completions \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "dedicated/openai/gpt-oss-120b-<routing_key>",
"messages": [{"role": "user", "content": "Hello from my dedicated endpoint!"}]
}'
5. Update Endpoint Configuration
PATCH updates only the provided fields.
PATCH /v0/dedicated_endpoints/{endpoint_id}
payload = {
"name": "GPT-120B Endpoint (updated)",
"region": "eu-west1",
"scaling": {"min_replicas": 2, "max_replicas": 4}
}
r = requests.patch(
f"{BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
headers=headers,
json=payload
)
print(json.dumps(r.json(), indent=2))
curl -X PATCH https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/<ENDPOINT_ID> \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"region": "eu-west1", "scaling": {"min_replicas": 2, "max_replicas": 4}}'
6. Delete an Endpoint
DELETE /v0/dedicated_endpoints/{endpoint_id}
This is permanent. Reserved GPUs are released immediately.
r = requests.delete(f"{BASE_URL}/v0/dedicated_endpoints/{endpoint_id}", headers=headers)
print(r.status_code)
curl -X DELETE https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/<ENDPOINT_ID> \
-H "Authorization: Bearer $API_TOKEN"
Notes
- Endpoint startup can take 10–15 minutes; during this time inference may return
404 Not Found.
- Billing is tied to active GPU usage based on your scaling settings. Startup/shutdown are not billed.
- Once live, the endpoint behaves like any OpenAI-style model under
/v1/chat/completions.