Best Practices

Common strategy on setting up deployment

GPU type + GPU count = performance per replica
Replicas = traffic capacity and availability

Start with the smallest setup that reliably meets baseline needs, then scale deliberately based on real traffic and observability:

1 replica set up: Choose a GPU type and gpu_count that meets single-replica performance goals

Autoscaling set up: Use replicas to absorb traffic spikes and concurrency

Choosing performance setup: GPU types & GPUs per replica

Choosing the right dedicated endpoint configuration depends on your workload’s priorities: latency, throughput, model size, and cost efficiency.

GPU type

GPU type determines the performance profile of each replica, including memory capacity, throughput, and cost. In general:

Higher-end GPUs	Mid-range GPUs
Largest models High throughput production workloads Low latency requirements Long context windows Heavy concurrent traffic	Smaller or optimized models Internal tools Moderate traffic Cost-sensitive deployments Testing before scaling

Rule of thumb: Choose the smallest GPU type that supports your model and region. If performance is insufficient, first scale GPU type or GPUs per replica before aggressively increasing replicas.

GPUs per replica (`gpu_count`)

gpu_count defines how many GPUs power a single replica. This primarily affects:

Per-request latency
Maximum throughput per replica
Ability to serve larger or more demanding workloads

Lower GPU count per replica	Higher GPU count per replica
Lower baseline cost Smaller workloads Simpler traffic patterns Experimental deployments	Higher throughput per replica Lower latency for large workloads Bigger model serving requirements Better vertical scaling before horizontal expansion

Rule of thumb: Start small when you are testing traffic, usage is uncertain, internal deployment, cost control matters. Scale up when requests are latency-sensitive, individual replicas saturate, you need stronger single-endpoint performanceAvoid Over-relying on replicas to solve underpowered replicas. More replicas help concurrency, but they do not fix poor per-request latency caused by insufficient GPU resources.

Choosing number of replicas

Replicas determine how much baseline and burst traffic your deployment can handle.

Min replicas: guaranteed baseline capacity

Minimum replicas are always allocated while the endpoint is active. Use higher min_replicas when:

You need predictable low latency
Traffic is steady
Cold starts are unacceptable
Capacity predictability matters

Use lower min_replicas when:

Traffic is irregular
You optimize for cost
Batch or internal workloads
Occasional warm-up is acceptable

Max replicas: burst scaling ceiling

Maximum replicas define how far autoscaling can expand if capacity is available. Use higher max_replicas when:

Traffic spikes significantly
Usage is unpredictable
You need burst headroom

Use lower max_replicas when:

Workload is stable
Budget control matters
Capacity is predictable

Rule of thumb:

min_replicas - what you always need
max_replicas - what you may need during peaks

Note that:

Min replicas are guaranteed while active. If you stop the endpoint the reservation will be freed up.
Max replicas scaling depends on available capacity and may not always be continuously available after scale-down. Read more in Capacity section

Get Started

AI Models Inference

Observability

Post-training

Data Lab

Teams & Access Management

Integrations

Common strategy on setting up deployment

Choosing performance setup: GPU types & GPUs per replica

GPU type

GPUs per replica (`gpu_count`)

Choosing number of replicas

Min replicas: guaranteed baseline capacity

Max replicas: burst scaling ceiling

Get Started

AI Models Inference

Observability

Post-training

Data Lab

Teams & Access Management

Integrations

Documentation Index

​Common strategy on setting up deployment

​Choosing performance setup: GPU types & GPUs per replica

​GPU type

​GPUs per replica (gpu_count)

​Choosing number of replicas

​Min replicas: guaranteed baseline capacity

​Max replicas: burst scaling ceiling

Common strategy on setting up deployment

Choosing performance setup: GPU types & GPUs per replica

GPU type

GPUs per replica (`gpu_count`)

Choosing number of replicas

Min replicas: guaranteed baseline capacity

Max replicas: burst scaling ceiling