Documentation Index
Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Deploying a dedicated endpoint via API takes three steps:
- List available model templates
- Create a dedicated endpoint
- Send inference requests
For updating, listing, and deleting endpoints, see Operating dedicated endpoints.
List model templates
List model templates that can be used to create a dedicated endpoint.
GET /v0/dedicated_endpoints/templates
import os, json, requests
CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]
r = requests.get(
f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/templates",
headers={"Authorization": f"Bearer {API_TOKEN}"},
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))
Use the template response as the source of truth for valid combinations of model_name, flavor_name, gpu_type, gpu_count and region.
Create dedicated endpoint
Create a dedicated endpoint from one of the available model templates.
POST /v0/dedicated_endpoints
import os, json, requests
CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]
payload = {
"name": "GPT-20B Endpoint",
"description": "Dedicated GPT-20B for internal apps",
"model_name": "openai/gpt-oss-20b",
"flavor_name": "base",
"gpu_type": "gpu-h100-sxm",
"gpu_count": 1,
"region": "eu-north1",
"scaling": {"min_replicas": 1, "max_replicas": 2},
}
r = requests.post(
f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints",
headers={"Authorization": f"Bearer {API_TOKEN}"},
json=payload,
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))
The response includes endpoint metadata, including:
Use endpoint_id to manage the endpoint through the control plane. Use routing_key as the model identifier when sending inference requests to the data plane.
Initial deployment can take several minutes. While provisioning, inference may fail (often 404) until the endpoint is routable.
Send inference requests
Once the endpoint is ready, send requests to the OpenAI-compatible data plane under /v1.
Use the routing_key returned by the control plane as the model value in inference requests:
Use a region-appropriate base URL for your inference calls. Example request URL:
https://api.tokenfactory.us-central1.nebius.com/v1/chat/completions
import os
from openai import OpenAI
API_TOKEN = os.environ["API_TOKEN"]
# Choose based on the endpoint region
INFERENCE_BASE_URL = "https://api.tokenfactory.us-central1.nebius.com/v1"
client = OpenAI(
base_url=INFERENCE_BASE_URL,
api_key=API_TOKEN,
)
routing_key = "<routing_key>" # use exactly what the API returned
response = client.chat.completions.create(
model=routing_key,
messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}],
)
print(response.choices[0].message.content)
We expose OpenAI-compatible inference routes under /v1. Template availability determines what kinds of models you can deploy (today, publicly available templates are primarily chat-capable).