Metrics and telemetry

Metrics are numeric measurements recorded over time that are emitted from the NVIDIA Run:ai cluster and telemetry is a numeric measurement recorded in real-time when emitted from the NVIDIA Run:ai cluster.

Scopes

NVIDIA Run:ai provides control-plane API which supports and aggregates analytics at various levels.

Level
Description

Cluster

A cluster is a set of nodes pools and nodes. With Cluster metrics, metrics are aggregated at the Cluster level. In the NVIDIA Run:ai user interface, metrics are available in the Overview dashboard.

Node

Data is aggregated at the node level.

Node pool

Data is aggregated at the node pool level.

Workload

Data is aggregated at the workload level. In some workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods.

Pod

The basic unit of execution.

Project

The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.

Department

Departments are a grouping of projects.

Supported metrics

Metric name in API
Applicable API endpoint
Metric name in UI per grid
Applicable UI grid

ALLOCATED_GPU

  • GPU devices (allocated)

  • Allocated GPUs

AVG_WORKLOAD_WAIT_TIME

CPU_LIMIT_CORES

CPU limit

CPU_MEMORY_LIMIT_BYTES

CPU memory limit

CPU_MEMORY_REQUEST_BYTES

CPU memory request

CPU_MEMORY_USAGE_BYTES

CPU memory usage

CPU_MEMORY_UTILIZATION

CPU memory utilization

CPU_REQUEST_CORES

CPU request

CPU_USAGE_CORES

CPU usage

CPU_UTILIZATION

  • CPU compute utilization

  • CPU utilization

GPU_ALLOCATION

GPU devices (allocated)

GPU_MEMORY_REQUEST_BYTES

GPU memory request

GPU_MEMORY_USAGE_BYTES

GPU memory usage

GPU_MEMORY_USAGE_BYTES_PER_GPU

GPU memory usage per GPU

GPU_MEMORY_UTILIZATION

GPU memory utilization

GPU_MEMORY_UTILIZATION_PER_GPU

GPU memory utilization per GPU

GPU_UTILIZATION_PER_GPU

GPU utilization per GPU

TOTAL_GPU

  • GPU devices total

  • Total GPUs

TOTAL_GPU_NODES

GPU_UTILIZATION_DISTRIBUTION

GPU utilization distribution

UNALLOCATED_GPU

  • GPU devices (unallocated)

  • Unallocated GPUs

CPU_QUOTA_MILLICORES

CPU_MEMORY_QUOTA_MB

CPU_ALLOCATION_MILLICORES

CPU_MEMORY_ALLOCATION_MB

POD_COUNT

RUNNING_POD_COUNT

Advanced metrics

NVIDIA provides extended metrics as shown here here. To enable these metrics, please contact NVIDIA Run:ai customer support.

Metric name in API
Applicable API endpoint
Metric name in UI
Applicable UI table

GPU_FP16_ENGINE_ACTIVITY_PER_GPU

GPU FP16 engine activity

GPU_FP32_ENGINE_ACTIVITY_PER_GPU

GPU FP32 engine activity

GPU_FP64_ENGINE_ACTIVITY_PER_GPU

GPU FP64 engine activity

GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU

Graphics engine activity

GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU

GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU

GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU

GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU

GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU

GPU_SM_ACTIVITY_PER_GPU

GPU SM activity

GPU_SM_OCCUPANCY_PER_GPU

GPU SM occupancy

GPU_TENSOR_ACTIVITY_PER_GPU

GPU tensor activity

Supported telemetry

Metric
Applicable API endpoint
Metric name in UI
Applicable UI table

WORKLOADS_COUNT

ALLOCATED_GPUS

Allocated GPUs

READY_GPU_NODES

Ready / Total GPU nodes

READY_GPUS

Ready / Total GPU devices

TOTAL_GPU_NODES

Ready / Total GPU nodes

TOTAL_GPUS

Ready / Total GPU devices

IDLE_ALLOCATED_GPUS

Idle allocated GPU devices

FREE_GPUS

Free GPU devices

TOTAL_CPU_CORES

CPU (Cores)

USED_CPU_CORES

ALLOCATED_CPU_CORES

Allocated CPU cores

TOTAL_GPU_MEMORY_BYTES

GPU memory

USED_GPU_MEMORY_BYTES

Used GPU memory

TOTAL_CPU_MEMORY_BYTES

CPU memory

USED_CPU_MEMORY_BYTES

Used CPU memory

ALLOCATED_CPU_MEMORY_BYTES

Allocated CPU memory

GPU_ALLOCATION_NON_PREEMPTIBLE

CPU_ALLOCATION_NON_PREEMPTIBLE

MEMORY_ALLOCATION_NON_PREEMPTIBLE

Last updated