Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt

Use this file to discover all available pages before exploring further.

Parameters Description

FieldTypeRequiredDescription
namestringyesDisplay name for the endpoint
descriptionstringnoOptional description
model_namestringyesTemplate model name (e.g. openai/gpt-oss-120b)
flavor_namestringyesTemplate flavor (e.g. base, fast)
gpu_typestringyesGPU type supported by the chosen template + flavor
gpu_countintegeryesgpu_count per replica. Total maximum GPUs = gpu_count × scaling.max_replicas
regionstringyeseu-north1, eu-west1, us-central1
scaling.min_replicasintegeryesMinimum replicas
scaling.max_replicasintegeryesMaximum replicas

Update Endpoint Configuration

PATCH updates only the provided fields.
PATCH /v0/dedicated_endpoints/{endpoint_id}
Endpoint’s region config region cannot be updated after dedicated endpoint creation.
import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]
endpoint_id = "<endpoint_id>"

payload = {
    "name": "GPT-20B Endpoint (updated)",
    "scaling": {"min_replicas": 2, "max_replicas": 4},
}

r = requests.patch(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
    json=payload,
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))
Scaling changes may trigger provisioning of additional replicas. Plan for a short warm-up period where new replicas come online.

List dedicated endpoints

GET /v0/dedicated_endpoints
import os, json, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]

r = requests.get(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
r.raise_for_status()
print(json.dumps(r.json(), indent=2))

Delete endpoint

Deletes an endpoint permanently.
DELETE /v0/dedicated_endpoints/{endpoint_id}
This is a permanent action. The GPUs associated with min_replicas are released for other users.
import os, requests

CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
API_TOKEN = os.environ["API_TOKEN"]
endpoint_id = "<endpoint_id>"

r = requests.delete(
    f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/{endpoint_id}",
    headers={"Authorization": f"Bearer {API_TOKEN}"},
)
print(r.status_code)
r.raise_for_status()