Inference Observability

Inference Observability provides real-time and historical metrics for your inference workloads, including latency, throughput, scaling, and error rates. It helps you debug performance issues, benchmark models, and understand capacity behavior of dedicated or public endpoints. You can access Observability from: Navigation Bar → Inference → Observability

What you can do with Observability

Inference Observability is designed to help you answer practical engineering questions - debug, optimise and operate:

When and why did latency increase?
Are errors coming from my requests or infrastructure?
Compare performance across endpoints
Understand token throughput and response times
Monitor scaling behavior and capacity
Track trends over time

Metrics available

Traffic

These metrics describe how much load your endpoint is handling and understand traffic spikes, prompt size trends, throughput limits

Metric	Description
Requests per minute	Number of requests sent to the API
Input tokens per minute	Incoming tokens processed
Output tokens per minute	Tokens generated by models
Input tokens per request	Distribution of prompt sizes
Output tokens per request	Distribution of response sizes

Latency

Latency metrics describe how long requests take at different stages. This helps identify model warmup effects, queueing or scaling delays, slow generation behavior. Latency charts display percentiles: p50, p90, p99

Metric	Description
End-to-end latency	Time from request sent to full response received
TTFT (Time to First Token)	Time until the first token is generated
Output speed (TPS)	Tokens generated per second

Autoscaling and Capacity

These metrics help explain why latency increases under load and whether scaling is working as expected. Some capacity metrics may be hidden depending on your pricing model.

Metric	Description
Active replicas	Currently running replicas

Errors and success rate

Metric	Description
Error rate by status code	Percentage of failed requests grouped by HTTP status

Typical error classes:

4xx — invalid requests or limits
429 — rate limiting or capacity limits
5xx — internal errors

This helps you quickly see:

Whether failures come from traffic patterns
Whether rate limits are reached
Whether infrastructure issues occurred

Looking for an incident history or having a down time? You can also check Token Factory section at Nebius Cloud status page

Filters and dimensions

You can filter metrics to narrow down specific traffic patterns. This makes it possible to compare regions, projects and benchmark endpoints. Filters apply to all charts simultaneously.

Filter	Purpose
Time range	Analyze recent or historical traffic
Model endpoint	Compare endpoints
Project	Analyze per-project usage
API key	Identify individual clients
Region	Compare cross-region performance
Error code	Debug failures
Prompt length	Compare short vs long prompts
Latency range	Focus on slow requests

Exporting metrics and API access

The Monitoring service provides metrics in Nebius Token Factory in two forms:

Preconfigured dashboards in the web UI show key metrics for each resource.
The API to metrics is available via Prometheus and Grafana® integrations. You can filter and visualize the metrics the way you want with API integration.

Observability API integrations

Go to Observability API access to learn more how to integrate with Prometheus and Grafana

How metrics are calculated

Metrics are aggregated in short windows and refreshed continuously. The aggregation window depends on the timeline choosen. Typical characteristics:

Near-real-time updates (usually within tens of seconds)
Percentile-based latency statistics
Rolling aggregation windows depending on the time scale choosen

Observability dashboards are intended for operational debugging and performance analysis, not billing reconciliation.

Access control

Observability is a project-nested entity and follows project permissions.

Role	Access
Organization Admin	View all projects and their observability
Organisation Billing Manager	No access to observability
Project Admin	View project observability
Project Member	View project observability

FAQ

Data retention and deletion

Observability data is tied to the project lifecycle. When a project is deleted associated observability data is automatically removed.

Regions and data locality

Projects may contain endpoints in multiple regions. Metrics are collected in the region where inference runs, but eventually stored at eu-northregion. Filters allow you to compare endpoints data in different regions

Get Started

AI Models Inference

Observability

Post-training

Data Lab

Utilities

Teams & Access Management

Other Capabilities

Integrations

What you can do with Observability

Metrics available

Traffic

Latency

Autoscaling and Capacity

Errors and success rate

Filters and dimensions

Exporting metrics and API access

Observability API integrations

How metrics are calculated

Access control

FAQ

Data retention and deletion

Regions and data locality

Get Started

AI Models Inference

Observability

Post-training

Data Lab

Utilities

Teams & Access Management

Other Capabilities

Integrations

​What you can do with Observability

​Metrics available

​Traffic

​Latency

​Autoscaling and Capacity

​Errors and success rate

​Filters and dimensions

​Exporting metrics and API access

Observability API integrations

​How metrics are calculated

​Access control

​FAQ

​Data retention and deletion

​Regions and data locality

What you can do with Observability

Metrics available

Traffic

Latency

Autoscaling and Capacity

Errors and success rate

Filters and dimensions

Exporting metrics and API access

How metrics are calculated

Access control

FAQ

Data retention and deletion

Regions and data locality