Dedicated Endpoints API

Overview

Dedicated endpoints let you:

Launch an isolated instance of a model template (e.g. openai/gpt-oss-120b, deepseek-ai/DeepSeek-V3.1, etc.)
Define scaling and GPU configuration
Manage lifecycle: create, update, and delete endpoints
Access them via standard OpenAI-compatible API (/v1/chat/completions)

Each endpoint is tied to a template that defines its supported configurations.

Authentication

All endpoints require a Bearer API token.

Authorization: Bearer <YOUR_API_TOKEN>

You can generate or manage tokens from your Nebius Token Factory account dashboard.

Base URL

https://api.tokenfactory.nebius.com

1. List Available Templates

List all available model templates you can base your endpoint on.

Request

GET /v0/dedicated_endpoints/templates

Python
cURL

import requests, os, json

BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ.get("API_TOKEN")

headers = {"Authorization": f"Bearer {API_TOKEN}"}
url = f"{BASE_URL}/v0/dedicated_endpoints/templates"

r = requests.get(url, headers=headers)
print(json.dumps(r.json(), indent=2))

2. Create a Dedicated Endpoint

Spin up a dedicated endpoint from a model template.

The model must have a corresponding template (check using the command above).

Request

POST /v0/dedicated_endpoints

Required fields

Field	Type	Description
`name`	string	Display name for your endpoint
`model_name`	string	Model template name (e.g. `gpt-oss-120b`)
`flavor_name`	string	Template flavor (e.g. `base`, `fast`)
`gpu_type`	string	GPU type supported by the template (e.g. `gpu-h100-sxm`)
`gpu_count`	integer	Number of GPUs per replica supported by the Model template
`scaling.min_replicas`	integer	Minimum number of replicas
`scaling.max_replicas`	integer	Maximum number of replicas

Python
cURL

import requests, os, json

BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ.get("API_TOKEN")

payload = {
    "name": "GPT-120B Endpoint",
    "description": "Dedicated GPT-120B for internal apps",
    "model_name": "openai/gpt-oss-120b",
    "flavor_name": "base",
    "gpu_type": "gpu-h100-sxm",
    "gpu_count": 1,
    "scaling": {
        "min_replicas": 1,
        "max_replicas": 2
    }
}

headers = {
    "Authorization": f"Bearer {API_TOKEN}",
    "Content-Type": "application/json"
}

r = requests.post(f"{BASE_URL}/v0/dedicated_endpoints", headers=headers, json=payload)
print(json.dumps(r.json(), indent=2))

Response

Returns endpoint details including:

id: internal endpoint ID
routing_key: model identifier to use for inference

3. List Your Endpoints

Request

GET /v0/dedicated_endpoints

Python
cURL

r = requests.get(f"{BASE_URL}/v0/dedicated_endpoints", headers=headers)
print(json.dumps(r.json(), indent=2))

4. Run Inference on Your Endpoint

Once the endpoint is ready (initial startup may take several minutes), you can use it via the standard OpenAI-compatible API.

Python
cURL

from openai import OpenAI
import os

API_TOKEN = os.environ.get("API_TOKEN")
client = OpenAI(
    base_url="https://api.tokenfactory.nebius.com/v1/",
    api_key=API_TOKEN
)

response = client.chat.completions.create(
    model="dedicated/openai/gpt-oss-120b-tSPQ1iBvCKj3",  # use routing_key from endpoint creation
    messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}]
)

print(response.choices[0].message.content)

5. Update Endpoint Configuration

Modify scaling or metadata without recreating the endpoint.

Request

The PATCH request is partial — all parameters are optional.
Only the fields you include in the request body will be updated; any fields you omit will remain unchanged.

PATCH /v0/dedicated_endpoints/{endpoint_id}

Python
cURL

payload = {
    "name": "GPT-120B Endpoint (updated)",
    "scaling": {"min_replicas": 2, "max_replicas": 4}
}

r = requests.patch(
    f"{BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers=headers,
    json=payload
)
print(json.dumps(r.json(), indent=2))

6. Delete an Endpoint

To decommission an endpoint:

Request

DELETE /v0/dedicated_endpoints/{endpoint_id}

Note: Deleting an endpoint is permanent.
The endpoint configuration and associated resources will be removed, and any reserved minimum replicas (GPU instances) will be released. This action cannot be undone.

Python
cURL

r = requests.delete(f"{BASE_URL}/v0/dedicated_endpoints/{endpoint_id}", headers=headers)
print(r.status_code)

Notes

Endpoint creation can take up to 10–15 minutes. During startup, inference calls may return 404 Not Found.
Each endpoint is billed based on active GPU usage and scaling configuration. Start up and shut down period are not counted towards billable minutes.
Once live, endpoints behave like any OpenAI-compatible model under /v1/chat/completions.

Get Started

AI Models Inference

Fine-tuning

Utilities

Teams & Access Management

Other Capabilities

Integrations

Overview

Authentication

Base URL

1. List Available Templates

Request

2. Create a Dedicated Endpoint

Request

Required fields

Response

3. List Your Endpoints

Request

4. Run Inference on Your Endpoint

5. Update Endpoint Configuration

Request

6. Delete an Endpoint

Request

Notes

Get Started

AI Models Inference

Fine-tuning

Utilities

Teams & Access Management

Other Capabilities

Integrations

​Overview

​Authentication

​Base URL

​1. List Available Templates

​Request

​2. Create a Dedicated Endpoint

​Request

​Required fields

​Response

​3. List Your Endpoints

​Request

​4. Run Inference on Your Endpoint

​5. Update Endpoint Configuration

​Request

​6. Delete an Endpoint

​Request

​Notes

Overview

Authentication

Base URL

1. List Available Templates

Request

2. Create a Dedicated Endpoint

Request

Required fields

Response

3. List Your Endpoints

Request

4. Run Inference on Your Endpoint

5. Update Endpoint Configuration

Request

6. Delete an Endpoint

Request

Notes