Overview
Dedicated endpoints let you:- Launch an isolated instance of a model template (e.g.
openai/gpt-oss-120b,deepseek-ai/DeepSeek-V3.1, etc.) - Define scaling and GPU configuration
- Manage lifecycle: create, update, and delete endpoints
- Access them via standard OpenAI-compatible API (
/v1/chat/completions)
Authentication
All endpoints require a Bearer API token.Base URL
1. List Available Templates
List all available model templates you can base your endpoint on.Request
- Python
- cURL
2. Create a Dedicated Endpoint
Spin up a dedicated endpoint from a model template.The model must have a corresponding template (check using the command above).
Request
Required fields
| Field | Type | Description |
|---|---|---|
name | string | Display name for your endpoint |
model_name | string | Model template name (e.g. gpt-oss-120b) |
flavor_name | string | Template flavor (e.g. base, fast) |
gpu_type | string | GPU type supported by the template (e.g. gpu-h100-sxm) |
gpu_count | integer | Number of GPUs per replica supported by the Model template |
scaling.min_replicas | integer | Minimum number of replicas |
scaling.max_replicas | integer | Maximum number of replicas |
- Python
- cURL
Response
Returns endpoint details including:id: internal endpoint IDrouting_key: model identifier to use for inference
3. List Your Endpoints
Request
- Python
- cURL
4. Run Inference on Your Endpoint
Once the endpoint is ready (initial startup may take several minutes), you can use it via the standard OpenAI-compatible API.- Python
- cURL
5. Update Endpoint Configuration
Modify scaling or metadata without recreating the endpoint.Request
The
Only the fields you include in the request body will be updated; any fields you omit will remain unchanged.
PATCH request is partial — all parameters are optional.Only the fields you include in the request body will be updated; any fields you omit will remain unchanged.
- Python
- cURL
6. Delete an Endpoint
To decommission an endpoint:Request
DELETE/v0/dedicated_endpoints/{endpoint_id}
Note: Deleting an endpoint is permanent.
The endpoint configuration and associated resources will be removed, and any reserved minimum replicas (GPU instances) will be released. This action cannot be undone.
The endpoint configuration and associated resources will be removed, and any reserved minimum replicas (GPU instances) will be released. This action cannot be undone.
- Python
- cURL
Notes
- Endpoint creation can take up to 10–15 minutes. During startup, inference calls may return
404 Not Found. - Each endpoint is billed based on active GPU usage and scaling configuration. Start up and shut down period are not counted towards billable minutes.
- Once live, endpoints behave like any OpenAI-compatible model under
/v1/chat/completions.