> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Best Practices

## Common strategy on setting up deployment

* **GPU type + GPU count** = performance per replica
* **Replicas** = traffic capacity and availability

Start with the smallest setup that reliably meets baseline needs, then scale deliberately based on real traffic and observability:

<Steps>
  <Step title="1 replica set up: Choose a GPU type and gpu_count that meets single-replica performance goals" />

  <Step title="Autoscaling set up: Use replicas to absorb traffic spikes and concurrency" />
</Steps>

***

## Choosing performance setup: GPU types & GPUs per replica

Choosing the right dedicated endpoint configuration depends on your workload’s priorities: latency, throughput, model size, and cost efficiency.

### GPU type

GPU type determines the performance profile of each replica, including memory capacity, throughput, and cost.

In general:

| Higher-end GPUs                                                                                                                                                              | Mid-range GPUs                                                                                                                                                  |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul><li>Largest models</li><li>High throughput production workloads</li><li>Low latency requirements</li><li>Long context windows</li><li>Heavy concurrent traffic</li></ul> | <ul><li>Smaller or optimized models</li><li>Internal tools</li><li>Moderate traffic</li><li>Cost-sensitive deployments</li><li>Testing before scaling</li></ul> |

<Tip>
  **Rule of thumb:** Choose the smallest GPU type that supports your model and region. If performance is insufficient, first scale GPU type or GPUs per replica before aggressively increasing replicas.
</Tip>

### GPUs per replica (`gpu_count`)

`gpu_count` defines how many GPUs power a single replica.

This primarily affects:

* Per-request latency
* Maximum throughput per replica
* Ability to serve larger or more demanding workloads

| Lower GPU count per replica                                                                                                       | Higher GPU count per replica                                                                                                                                                                    |
| --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <ul><li>Lower baseline cost</li><li>Smaller workloads</li><li>Simpler traffic patterns</li><li>Experimental deployments</li></ul> | <ul><li>Higher throughput per replica</li><li>Lower latency for large workloads</li><li>Bigger model serving requirements</li><li>Better vertical scaling before horizontal expansion</li></ul> |

<Tip>
  **Rule of thumb:** Start small when you are testing traffic, usage is uncertain, internal deployment, cost control matters. Scale up when requests are latency-sensitive, individual replicas saturate, you need stronger single-endpoint performance

  Avoid Over-relying on replicas to solve underpowered replicas. More replicas help concurrency, but they do not fix poor per-request latency caused by insufficient GPU resources.
</Tip>

***

## Choosing number of replicas

Replicas determine how much baseline and burst traffic your deployment can handle.

### Min replicas: guaranteed baseline capacity

Minimum replicas are always allocated while the endpoint is active.

Use higher `min_replicas` when:

* You need predictable low latency
* Traffic is steady
* Cold starts are unacceptable
* Capacity predictability matters

Use lower `min_replicas` when:

* Traffic is irregular
* You optimize for cost
* Batch or internal workloads
* Occasional warm-up is acceptable

### Max replicas: burst scaling ceiling

Maximum replicas define how far autoscaling can expand if capacity is available.

Use higher `max_replicas` when:

* Traffic spikes significantly
* Usage is unpredictable
* You need burst headroom

Use lower `max_replicas` when:

* Workload is stable
* Budget control matters
* Capacity is predictable

<Tip>
  **Rule of thumb:**

  * min\_replicas - what you always need
  * max\_replicas - what you may need during peaks
</Tip>

<Note>
  **Note that:**

  * Min replicas are guaranteed while active. If you stop the endpoint the reservation will be freed up.
  * Max replicas scaling depends on available capacity and may not always be continuously available after scale-down. [Read more in Capacity section](/ai-models-inference/capacity-and-scaling)
</Note>
