Inference Observability provides real-time and historical metrics for your inference workloads, including latency, throughput, scaling, and error rates. It helps you debug performance issues, benchmark models, and understand capacity behavior of dedicated or public endpoints.You can access Observability from:Navigation Bar → Inference → Observability
Latency metrics describe how long requests take at different stages. This helps identify model warmup effects, queueing or scaling delays, slow generation behavior. Latency charts display percentiles: p50, p90, p99
These metrics help explain why latency increases under load and whether scaling is working as expected. Some capacity metrics may be hidden depending on your pricing model.
You can filter metrics to narrow down specific traffic patterns. This makes it possible to compare regions, projects and benchmark endpoints. Filters apply to all charts simultaneously.
The Monitoring service provides metrics in Nebius Token Factory in two forms:
Preconfigured dashboards in the web UI show key metrics for each resource.
The API to metrics is available via Prometheus and Grafana® integrations. You can filter and visualize the metrics the way you want with API integration.
Projects may contain endpoints in multiple regions. Metrics are collected in the region where inference runs, but eventually stored at eu-northregion. Filters allow you to compare endpoints data in different regions