NIM Observability Metrics via API

The NIM observability metrics provides programmatic access to key runtime and performance metrics for workloads deployed as NIM. These metrics extend beyond traditional infrastructure monitoring by offering workload-specific insights such as request throughput, latency and token usage (for LLMs). To support monitoring and optimization of these workloads, NVIDIA Run:ai exposes NIM observability metrics directly through the Workloads and Pods APIs.

NVIDIA Run:ai collects NIM observability metrics from workloads to generate charts and insights at the control plane level and does not expose the raw NIM metrics directly as-is. For more details, see Observability for NVIDIA NIM for LLMs.

NIM Metrics Availability

To enable observability, NVIDIA NIM workloads must be deployed in one of the supported forms:

NVIDIA Run:ai - Submit workloads through the NVIDIA Run:ai platform.
NIM Operator - Recommended for enterprise deployments with full lifecycle management.
Helm Chart - For Kubernetes-native deployment and integration with existing cluster tooling.
Container Image - A direct deployment option where NIM runs as a container image. This approach requires two conditions to ensure workloads are correctly identified and monitored:
- Apply the label run.ai/nim-workload: "true"
- Define a service port with the name serving-port

NIM-Specific Request Metrics

NIM_NUM_REQUESTS_RUNNING - Number of requests currently running on GPU
NIM_NUM_REQUESTS_WAITING - Number of requests waiting to be processed
NIM_NUM_REQUEST_MAX - Max number of requests that can be run concurrently by the model
NIM_REQUEST_SUCCESS_TOTAL - Number of successful requests, requests with finish reason “stop” or “length” are counted
NIM_REQUEST_FAILURE_TOTAL - Number of failed requests, requests with other finish reason are counted
NIM_GPU_CACHE_USAGE_PERC - GPU KV-cache usage - 1 means 100 percent usage

NIM Latency Metrics

Histogram and percentile-based latency metrics provide detailed insights into request performance:

NIM_TIME_TO_FIRST_TOKEN_SECONDS - Histogram of time to first token in seconds
NIM_E2E_REQUEST_LATENCY_SECONDS - Histogram of end to end request latency in seconds
NIM_TIME_TO_FIRST_TOKEN_SECONDS_PERCENTILES - Percentiles (p50, p90, p99) of time to first token in seconds
NIM_E2E_REQUEST_LATENCY_SECONDS_PERCENTILES - Percentiles (p50, p90, p99) of end to end request latency in seconds

Response Shape by Metric Type

The API returns metrics as either time-series gauges (under measurements) or histogram/percentile metrics (under histogram).

Histogram-based NIM Metrics

If you request any of the following metricType values, the results appear in the histogram field:

NIM_TIME_TO_FIRST_TOKEN_SECONDS
NIM_E2E_REQUEST_LATENCY_SECONDS
NIM_TIME_TO_FIRST_TOKEN_SECONDS_PERCENTILES
NIM_E2E_REQUEST_LATENCY_SECONDS_PERCENTILES

Percentile Variants

For metrics ending with _PERCENTILES, each entry is a timestamped percentile map:

{
  "histogram": [
    {
      "type": "NIM_TIME_TO_FIRST_TOKEN_SECONDS_PERCENTILES",
      "values": [
        {
          "timestamp": "2025-08-20T10:51:26Z",
          "data": {
            "p50": 0.0533,
            "p90": 0.0533,
            "p99": 10.0
          }
        }
      ]
    }
  ]
}

Raw Histogram Buckets

For non-percentile histogram metrics, each entry is a timestamped bucket → count map:

{
  "histogram": [
    {
      "type": "NIM_TIME_TO_FIRST_TOKEN_SECONDS",
      "values": [
        {
          "timestamp": "2025-08-20T10:51:26Z",
          "data": {
            "0.001": 0,
            "0.005": 0,
            "+Inf": 6
          }
        }
      ]
    }
  ]
}

Non-Histogram Metrics (Gauges)

All other metric types are returned as time-series measurements with numeric values per timestamp:

{
  "measurements": [
    {
      "type": "GPU_UTILIZATION",
      "values": [
        { "value": "1",  "timestamp": "2025-08-20T10:51:26Z" },
        { "value": "10", "timestamp": "2025-08-20T10:52:26Z" },
        { "value": "11", "timestamp": "2025-08-20T10:53:26Z" }
      ]
    }
  ]
}

Mixed Requests

When multiple metricType values are requested, the response may include both sections:

{
  "measurements": [ /* non-histogram series here */ ],
  "histogram":    [ /* histogram/percentiles here */ ]
}

PreviousUsing Node Affinity via API NextAPI Python Client Reference

Last updated 1 month ago

Good morning