Quickstart: Deploy via API

Overview

Deploying a dedicated endpoint via API takes three steps:

List available model templates
Create a dedicated endpoint
Send inference requests

For updating, listing, and deleting endpoints, see Operating dedicated endpoints.

List model templates

List model templates that can be used to create a dedicated endpoint.

GET /v0/dedicated_endpoints/templates

import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

r = requests.get(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/templates",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))

Use the template response as the source of truth for valid combinations of model_name, flavor_name, gpu_type, gpu_count and region.

Create dedicated endpoint

Create a dedicated endpoint from one of the available model templates.

POST /v0/dedicated_endpoints

import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

payload = {
    "name": "GPT-20B Endpoint",
    "description": "Dedicated GPT-20B for internal apps",
    "model_name": "openai/gpt-oss-20b",
    "flavor_name": "base",
    "gpu_type": "gpu-h100-sxm",
    "gpu_count": 1,
    "region": "eu-north1",
    "scaling": {"min_replicas": 1, "max_replicas": 2},
}

r = requests.post(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
    json=payload,
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))

The response includes endpoint metadata, including:

endpoint_id
routing_key

Use endpoint_id to manage the endpoint through the control plane. Use routing_key as the model identifier when sending inference requests to the data plane.

Initial deployment can take several minutes. While provisioning, inference may fail (often 404) until the endpoint is routable.

Send inference requests

Once the endpoint is ready, send requests to the OpenAI-compatible data plane under /v1. Use the routing_key returned by the control plane as the model value in inference requests:

model = "<routing_key>"

Use a region-appropriate base URL for your inference calls. Example request URL:

https://api.tokenfactory.us-central1.nebius.com/v1/chat/completions

import os
from openai import OpenAI

API_TOKEN = os.environ["API_TOKEN"]

# Choose based on the endpoint region
INFERENCE_BASE_URL = "https://api.tokenfactory.us-central1.nebius.com/v1"

client = OpenAI(
    base_url=INFERENCE_BASE_URL,
    api_key=API_TOKEN,
)

routing_key = "<routing_key>"  # use exactly what the API returned
response = client.chat.completions.create(
    model=routing_key,
    messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}],
)

print(response.choices[0].message.content)

We expose OpenAI-compatible inference routes under /v1. Template availability determines what kinds of models you can deploy (today, publicly available templates are primarily chat-capable).

Get Started

AI Models Inference

Observability

Post-training

Data Lab

Utilities

Teams & Access Management

Other Capabilities

Integrations

Overview

List model templates

Create dedicated endpoint

Send inference requests

Get Started

AI Models Inference

Observability

Post-training

Data Lab

Utilities

Teams & Access Management

Other Capabilities

Integrations

Documentation Index

​Overview

​List model templates

​Create dedicated endpoint

​Send inference requests

Overview

List model templates

Create dedicated endpoint

Send inference requests