> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tokenfactory.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Inference Observability

Inference Observability provides real-time and historical metrics for your inference workloads, including latency, throughput, scaling, and error rates. It helps you debug performance issues, benchmark models, and understand capacity behavior of dedicated or public endpoints.

You can access Observability from:

**Navigation Bar → Inference → Observability**

## What you can do with Observability

Inference Observability is designed to help you answer practical engineering questions - debug, optimise and operate:

* When and why did latency increase?
* Are errors coming from my requests or infrastructure?
* Compare performance across endpoints
* Understand token throughput and response times
* Monitor scaling behavior and capacity
* Track trends over time

  ***

## Metrics available

### Traffic

These metrics describe how much load your endpoint is handling and understand traffic spikes, prompt size trends, throughput limits

| Metric                    | Description                        |
| :------------------------ | :--------------------------------- |
| Requests per minute       | Number of requests sent to the API |
| Input tokens per minute   | Incoming tokens processed          |
| Output tokens per minute  | Tokens generated by models         |
| Input tokens per request  | Distribution of prompt sizes       |
| Output tokens per request | Distribution of response sizes     |

### Latency

Latency metrics describe how long requests take at different stages. This helps identify model warmup effects, queueing or scaling delays, slow generation behavior. Latency charts display percentiles: `p50`, `p90`, `p99`

| Metric                     | Description                                      |
| :------------------------- | :----------------------------------------------- |
| End-to-end latency         | Time from request sent to full response received |
| TTFT (Time to First Token) | Time until the first token is generated          |
| Output speed (TPS)         | Tokens generated per second                      |

### Autoscaling and Capacity

These metrics help explain why latency increases under load and whether scaling is working as expected. Some capacity metrics may be hidden depending on your pricing model.

| Metric          | Description                |
| :-------------- | :------------------------- |
| Active replicas | Currently running replicas |

### Errors and success rate

| Metric                    | Description                                          |
| :------------------------ | :--------------------------------------------------- |
| Error rate by status code | Percentage of failed requests grouped by HTTP status |

Typical error classes:

* 4xx — invalid requests or limits
* 429 — rate limiting or capacity limits
* 5xx — internal errors

This helps you quickly see:

* Whether failures come from traffic patterns
* Whether rate limits are reached
* Whether infrastructure issues occurred

<Note>
  Looking for an incident history or having a down time? You can also check Token Factory section at [Nebius Cloud status page](https://status.nebius.com)
</Note>

***

## Filters and dimensions

You can filter metrics to narrow down specific traffic patterns. This makes it possible to compare regions, projects and benchmark endpoints. Filters apply to all charts simultaneously.

| Filter         | Purpose                              |
| :------------- | :----------------------------------- |
| Time range     | Analyze recent or historical traffic |
| Model endpoint | Compare endpoints                    |
| Project        | Analyze per-project usage            |
| API key        | Identify individual clients          |
| Region         | Compare cross-region performance     |
| Error code     | Debug failures                       |
| Prompt length  | Compare short vs long prompts        |
| Latency range  | Focus on slow requests               |

***

## Exporting metrics and API access

The Monitoring service provides metrics in Nebius Token Factory in two forms:

* Preconfigured [dashboards](https://tokenfactory.nebius.com/observability) in the web UI show key metrics for each resource.
* The API to metrics is available via [Prometheus](https://docs.nebius.com/observability/metrics/prometheus) and [Grafana®](https://docs.nebius.com/observability/metrics/grafana) integrations. You can filter and visualize the metrics the way you want with API integration.

<Card icon="arrow-right-to-bracket" horizontal href="https://docs.tokenfactory.nebius.com/ai-models-inference/observability-api-integrations" title="Observability API integrations">
  Go to Observability API access to learn more how to integrate with Prometheus and Grafana
</Card>

***

## How metrics are calculated

Metrics are aggregated in short windows and refreshed continuously. The aggregation window depends on the timeline choosen.

Typical characteristics:

* Near-real-time updates (usually within tens of seconds)
* Percentile-based latency statistics
* Rolling aggregation windows depending on the time scale choosen

<Note>
  Observability dashboards are intended for operational debugging and performance analysis, not billing reconciliation.
</Note>

***

## Access control

Observability is a project-nested entity and follows project permissions.

| Role                         | Access                                    |
| :--------------------------- | :---------------------------------------- |
| Organization Admin           | View all projects and their observability |
| Organisation Billing Manager | No access to observability                |
| Project Admin                | View project observability                |
| Project Member               | View project observability                |

***

## FAQ

### Data retention and deletion

Observability data is tied to the project lifecycle. When a project is deleted associated observability data is automatically removed.

### Regions and data locality

Projects may contain endpoints in multiple regions. Metrics are collected in the region where inference runs, but eventually stored at `eu-north`region. Filters allow you to compare endpoints data in different regions
