Skip to main content

Overview

Dedicated endpoints let you:
  • Launch an isolated instance of a model template (e.g. openai/gpt-oss-120b, deepseek-ai/DeepSeek-V3.1, etc.)
  • Define scaling and GPU configuration
  • Manage lifecycle: create, update, and delete endpoints
  • Access them via standard OpenAI-compatible API (/v1/chat/completions)
Each endpoint is tied to a template that defines its supported configurations.

Authentication

All endpoints require a Bearer API token.
Authorization: Bearer <YOUR_API_TOKEN>
You can generate or manage tokens from your Nebius Token Factory account dashboard.

Base URL

https://api.tokenfactory.nebius.com

1. List Available Templates

List all available model templates you can base your endpoint on.

Request

GET /v0/dedicated_endpoints/templates
  • Python
  • cURL
import requests, os, json

BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ.get("API_TOKEN")

headers = {"Authorization": f"Bearer {API_TOKEN}"}
url = f"{BASE_URL}/v0/dedicated_endpoints/templates"

r = requests.get(url, headers=headers)
print(json.dumps(r.json(), indent=2))

2. Create a Dedicated Endpoint

Spin up a dedicated endpoint from a model template.
The model must have a corresponding template (check using the command above).

Request

POST /v0/dedicated_endpoints

Required fields

FieldTypeDescription
namestringDisplay name for your endpoint
model_namestringModel template name (e.g. gpt-oss-120b)
flavor_namestringTemplate flavor (e.g. base, fast)
gpu_typestringGPU type supported by the template (e.g. gpu-h100-sxm)
gpu_countintegerNumber of GPUs per replica supported by the Model template
scaling.min_replicasintegerMinimum number of replicas
scaling.max_replicasintegerMaximum number of replicas
  • Python
  • cURL
import requests, os, json

BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ.get("API_TOKEN")

payload = {
    "name": "GPT-120B Endpoint",
    "description": "Dedicated GPT-120B for internal apps",
    "model_name": "openai/gpt-oss-120b",
    "flavor_name": "base",
    "gpu_type": "gpu-h100-sxm",
    "gpu_count": 1,
    "scaling": {
        "min_replicas": 1,
        "max_replicas": 2
    }
}

headers = {
    "Authorization": f"Bearer {API_TOKEN}",
    "Content-Type": "application/json"
}

r = requests.post(f"{BASE_URL}/v0/dedicated_endpoints", headers=headers, json=payload)
print(json.dumps(r.json(), indent=2))

Response

Returns endpoint details including:
  • id: internal endpoint ID
  • routing_key: model identifier to use for inference

3. List Your Endpoints

Request

GET /v0/dedicated_endpoints
  • Python
  • cURL
r = requests.get(f"{BASE_URL}/v0/dedicated_endpoints", headers=headers)
print(json.dumps(r.json(), indent=2))

4. Run Inference on Your Endpoint

Once the endpoint is ready (initial startup may take several minutes), you can use it via the standard OpenAI-compatible API.
  • Python
  • cURL
from openai import OpenAI
import os

API_TOKEN = os.environ.get("API_TOKEN")
client = OpenAI(
    base_url="https://api.tokenfactory.nebius.com/v1/",
    api_key=API_TOKEN
)

response = client.chat.completions.create(
    model="dedicated/openai/gpt-oss-120b-tSPQ1iBvCKj3",  # use routing_key from endpoint creation
    messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}]
)

print(response.choices[0].message.content)

5. Update Endpoint Configuration

Modify scaling or metadata without recreating the endpoint.

Request

The PATCH request is partial — all parameters are optional.
Only the fields you include in the request body will be updated; any fields you omit will remain unchanged.
PATCH /v0/dedicated_endpoints/{endpoint_id}
  • Python
  • cURL
payload = {
    "name": "GPT-120B Endpoint (updated)",
    "scaling": {"min_replicas": 2, "max_replicas": 4}
}

r = requests.patch(
    f"{BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers=headers,
    json=payload
)
print(json.dumps(r.json(), indent=2))

6. Delete an Endpoint

To decommission an endpoint:

Request

DELETE /v0/dedicated_endpoints/{endpoint_id}
Note: Deleting an endpoint is permanent.
The endpoint configuration and associated resources will be removed, and any reserved minimum replicas (GPU instances) will be released. This action cannot be undone.
  • Python
  • cURL
r = requests.delete(f"{BASE_URL}/v0/dedicated_endpoints/{endpoint_id}", headers=headers)
print(r.status_code)

Notes

  • Endpoint creation can take up to 10–15 minutes. During startup, inference calls may return 404 Not Found.
  • Each endpoint is billed based on active GPU usage and scaling configuration. Start up and shut down period are not counted towards billable minutes.
  • Once live, endpoints behave like any OpenAI-compatible model under /v1/chat/completions.