Metrics and Telemetry
Metrics are numeric measurements recorded over time that are emitted from the NVIDIA Run:ai cluster and telemetry is a numeric measurement recorded in real-time when emitted from the NVIDIA Run:ai cluster.
Scopes
NVIDIA Run:ai provides control-plane API which supports and aggregates analytics at various levels.
Cluster
A cluster is a set of nodes pools and nodes. With Cluster metrics, metrics are aggregated at the Cluster level. In the NVIDIA Run:ai user interface, metrics are available in the Overview dashboard.
Node
Data is aggregated at the node level.
Node pool
Data is aggregated at the node pool level.
Workload
Data is aggregated at the workload level. In some workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods.
Pod
The basic unit of execution.
Project
The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.
Department
Departments are a grouping of projects.
Supported Metrics
CPU_MEMORY_UTILIZATION
CPU memory utilization
CPU_UTILIZATION
CPU compute utilization
CPU utilization
GPU_UTILIZATION
GPU compute utilization
UNALLOCATED_GPU
GPU devices (unallocated)
Unallocated GPUs
GPU Profiling
NVIDIA provides extended metrics as shown here. To enable these metrics, please contact NVIDIA Run:ai customer support.
NVIDIA NIM
NVIDIA NIM metrics provide workload-level observability, including key runtime and performance data such as request throughput, latency, and token usage for LLMs. See NIM observability metrics via API for more details.
NIM_TIME_TO_FIRST_TOKEN_SECONDS_PERCENTILES
Time to first token (TTFT) by percentiles
Supported Telemetry
Last updated