Metrics and telemetry
Metrics are numeric measurements recorded over time that are emitted from the NVIDIA Run:ai cluster and telemetry is a numeric measurement recorded in real-time when emitted from the NVIDIA Run:ai cluster.
Scopes
NVIDIA Run:ai provides control-plane API which supports and aggregates analytics at various levels.
Cluster
A cluster is a set of nodes pools and nodes. With Cluster metrics, metrics are aggregated at the Cluster level. In the NVIDIA Run:ai user interface, metrics are available in the Overview dashboard.
Node
Data is aggregated at the node level.
Node pool
Data is aggregated at the node pool level.
Workload
Data is aggregated at the workload level. In some workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods.
Pod
The basic unit of execution.
Project
The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.
Department
Departments are a grouping of projects.
Supported metrics
CPU_MEMORY_UTILIZATION
CPU memory utilization
CPU_UTILIZATION
CPU compute utilization
CPU utilization
GPU_UTILIZATION
GPU compute utilization
UNALLOCATED_GPU
GPU devices (unallocated)
Unallocated GPUs
Advanced metrics
NVIDIA provides extended metrics as shown here here. To enable these metrics, please contact NVIDIA Run:ai customer support.
Supported telemetry
Last updated