> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Quickstart: Deploy via API

## Overview

Deploying a dedicated endpoint via API takes three steps:

1. List available model templates
2. Create a dedicated endpoint
3. Send inference requests

For updating, listing, and deleting endpoints, see **Operating dedicated endpoints**.

## List model templates

List model templates that can be used to create a dedicated endpoint.

```http theme={null}
GET /v0/dedicated_endpoints/templates
```

<CodeGroup>
  ```python Python theme={null}
  import os, json, requests

  CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
  API_TOKEN = os.environ["API_TOKEN"]

  r = requests.get(
      f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints/templates",
      headers={"Authorization": f"Bearer {API_TOKEN}"},
  )
  r.raise_for_status()
  print(json.dumps(r.json(), indent=2))
  ```

  ```shellscript cURL theme={null}
  curl -sS \
    -H "Authorization: Bearer $API_TOKEN" \
    "https://api.tokenfactory.nebius.com/v0/dedicated_endpoints/templates"
  ```
</CodeGroup>

<Tip>
  Use the template response as the source of truth for valid combinations of `model_name`, `flavor_name`, `gpu_type`, `gpu_count` and `region`.
</Tip>

## Create dedicated endpoint

Create a dedicated endpoint from one of the available model templates.

```http theme={null}
POST /v0/dedicated_endpoints
```

<CodeGroup>
  ```python Python theme={null}
  import os, json, requests

  CONTROL_PLANE_BASE_URL = "https://api.tokenfactory.nebius.com"
  API_TOKEN = os.environ["API_TOKEN"]

  payload = {
      "name": "GPT-20B Endpoint",
      "description": "Dedicated GPT-20B for internal apps",
      "model_name": "openai/gpt-oss-20b",
      "flavor_name": "base",
      "gpu_type": "gpu-h100-sxm",
      "gpu_count": 1,
      "region": "eu-north1",
      "scaling": {"min_replicas": 1, "max_replicas": 2},
  }

  r = requests.post(
      f"{CONTROL_PLANE_BASE_URL}/v0/dedicated_endpoints",
      headers={"Authorization": f"Bearer {API_TOKEN}"},
      json=payload,
  )
  r.raise_for_status()
  print(json.dumps(r.json(), indent=2))
  ```

  ```shellscript cURL theme={null}
  curl -sS -X POST \
    "https://api.tokenfactory.nebius.com/v0/dedicated_endpoints" \
    -H "Authorization: Bearer $API_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
      "name": "GPT-20B Endpoint",
      "description": "Dedicated GPT-20B for internal apps",
      "model_name": "openai/gpt-oss-20b",
      "flavor_name": "base",
      "gpu_type": "gpu-h100-sxm",
      "gpu_count": 1,
      "region": "eu-north1",
      "scaling": { "min_replicas": 1, "max_replicas": 2 }
    }'
  ```
</CodeGroup>

The response includes endpoint metadata, including:

* `endpoint_id`
* `routing_key`

Use `endpoint_id` to manage the endpoint through the control plane. Use `routing_key` as the model identifier when sending inference requests to the data plane.

<Note>
  Initial deployment can take several minutes. While provisioning, inference may fail (often `404`) until the endpoint is routable.
</Note>

## Send inference requests

Once the endpoint is ready, send requests to the OpenAI-compatible data plane under `/v1`.

Use the `routing_key` returned by the control plane as the `model` value in inference requests:

```text theme={null}
model = "<routing_key>"
```

Use a region-appropriate base URL for your inference calls. Example request URL:

```text theme={null}
https://api.tokenfactory.us-central1.nebius.com/v1/chat/completions
```

<CodeGroup>
  ```python Python (OpenAI SDK) theme={null}
  import os
  from openai import OpenAI

  API_TOKEN = os.environ["API_TOKEN"]

  # Choose based on the endpoint region
  INFERENCE_BASE_URL = "https://api.tokenfactory.us-central1.nebius.com/v1"

  client = OpenAI(
      base_url=INFERENCE_BASE_URL,
      api_key=API_TOKEN,
  )

  routing_key = "<routing_key>"  # use exactly what the API returned
  response = client.chat.completions.create(
      model=routing_key,
      messages=[{"role": "user", "content": "Explain the difference between LLM fine-tuning and RAG."}],
  )

  print(response.choices[0].message.content)
  ```

  ```shellscript cURL theme={null}
  curl -sS "https://api.tokenfactory.us-central1.nebius.com/v1/chat/completions" \
    -H "Authorization: Bearer $API_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "<routing_key>",
      "messages": [
        { "role": "user", "content": "Hello from my dedicated endpoint!" }
      ]
    }'
  ```
</CodeGroup>

<Note>
  We expose OpenAI-compatible inference routes under `/v1`. Template availability determines what kinds of models you can deploy (today, publicly available templates are primarily chat-capable).
</Note>
