Documentation Index
Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Common strategy on setting up deployment
- GPU type + GPU count = performance per replica
- Replicas = traffic capacity and availability
Choosing performance setup: GPU types & GPUs per replica
Choosing the right dedicated endpoint configuration depends on your workload’s priorities: latency, throughput, model size, and cost efficiency.GPU type
GPU type determines the performance profile of each replica, including memory capacity, throughput, and cost. In general:| Higher-end GPUs | Mid-range GPUs |
|---|---|
|
|
GPUs per replica (gpu_count)
gpu_count defines how many GPUs power a single replica.
This primarily affects:
- Per-request latency
- Maximum throughput per replica
- Ability to serve larger or more demanding workloads
| Lower GPU count per replica | Higher GPU count per replica |
|---|---|
|
|
Choosing number of replicas
Replicas determine how much baseline and burst traffic your deployment can handle.Min replicas: guaranteed baseline capacity
Minimum replicas are always allocated while the endpoint is active. Use highermin_replicas when:
- You need predictable low latency
- Traffic is steady
- Cold starts are unacceptable
- Capacity predictability matters
min_replicas when:
- Traffic is irregular
- You optimize for cost
- Batch or internal workloads
- Occasional warm-up is acceptable
Max replicas: burst scaling ceiling
Maximum replicas define how far autoscaling can expand if capacity is available. Use highermax_replicas when:
- Traffic spikes significantly
- Usage is unpredictable
- You need burst headroom
max_replicas when:
- Workload is stable
- Budget control matters
- Capacity is predictable
Note that:
- Min replicas are guaranteed while active. If you stop the endpoint the reservation will be freed up.
- Max replicas scaling depends on available capacity and may not always be continuously available after scale-down. Read more in Capacity section