Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt

Use this file to discover all available pages before exploring further.

Common strategy on setting up deployment

  • GPU type + GPU count = performance per replica
  • Replicas = traffic capacity and availability
Start with the smallest setup that reliably meets baseline needs, then scale deliberately based on real traffic and observability:
1

1 replica set up: Choose a GPU type and gpu_count that meets single-replica performance goals

2

Autoscaling set up: Use replicas to absorb traffic spikes and concurrency


Choosing performance setup: GPU types & GPUs per replica

Choosing the right dedicated endpoint configuration depends on your workload’s priorities: latency, throughput, model size, and cost efficiency.

GPU type

GPU type determines the performance profile of each replica, including memory capacity, throughput, and cost. In general:
Higher-end GPUsMid-range GPUs
  • Largest models
  • High throughput production workloads
  • Low latency requirements
  • Long context windows
  • Heavy concurrent traffic
  • Smaller or optimized models
  • Internal tools
  • Moderate traffic
  • Cost-sensitive deployments
  • Testing before scaling
Rule of thumb: Choose the smallest GPU type that supports your model and region. If performance is insufficient, first scale GPU type or GPUs per replica before aggressively increasing replicas.

GPUs per replica (gpu_count)

gpu_count defines how many GPUs power a single replica. This primarily affects:
  • Per-request latency
  • Maximum throughput per replica
  • Ability to serve larger or more demanding workloads
Lower GPU count per replicaHigher GPU count per replica
  • Lower baseline cost
  • Smaller workloads
  • Simpler traffic patterns
  • Experimental deployments
  • Higher throughput per replica
  • Lower latency for large workloads
  • Bigger model serving requirements
  • Better vertical scaling before horizontal expansion
Rule of thumb: Start small when you are testing traffic, usage is uncertain, internal deployment, cost control matters. Scale up when requests are latency-sensitive, individual replicas saturate, you need stronger single-endpoint performanceAvoid Over-relying on replicas to solve underpowered replicas. More replicas help concurrency, but they do not fix poor per-request latency caused by insufficient GPU resources.

Choosing number of replicas

Replicas determine how much baseline and burst traffic your deployment can handle.

Min replicas: guaranteed baseline capacity

Minimum replicas are always allocated while the endpoint is active. Use higher min_replicas when:
  • You need predictable low latency
  • Traffic is steady
  • Cold starts are unacceptable
  • Capacity predictability matters
Use lower min_replicas when:
  • Traffic is irregular
  • You optimize for cost
  • Batch or internal workloads
  • Occasional warm-up is acceptable

Max replicas: burst scaling ceiling

Maximum replicas define how far autoscaling can expand if capacity is available. Use higher max_replicas when:
  • Traffic spikes significantly
  • Usage is unpredictable
  • You need burst headroom
Use lower max_replicas when:
  • Workload is stable
  • Budget control matters
  • Capacity is predictable
Rule of thumb:
  • min_replicas - what you always need
  • max_replicas - what you may need during peaks
Note that:
  • Min replicas are guaranteed while active. If you stop the endpoint the reservation will be freed up.
  • Max replicas scaling depends on available capacity and may not always be continuously available after scale-down. Read more in Capacity section