GPU Profiling Metrics
This guide describes how to enable advanced GPU profiling metrics from NVIDIA Data Center GPU Manager (DCGM). These metrics provide deep visibility into GPU performance, including SM utilization, memory bandwidth, tensor core activity, and compute pipeline behavior - extending beyond standard GPU utilization metrics. For more details on metric definitions, see NVIDIA profiling metrics.
Available Metrics
Once enabled, NVIDIA Run:ai exposes the following GPU profiling metrics:
SM activity - SM active cycles and occupancy
Memory bandwidth - DRAM active cycles
Compute pipelines - FP16/FP32/FP64 and tensor core activity
Graphics engine - GR engine utilization
PCIe/NVLink - Data transfer rates
NVIDIA Run:ai automatically aggregates these metrics at multiple levels:
Per GPU device
Per pod
Per workload
Per node
Configuring the DCGM Exporter
The DCGM Exporter is responsible for exposing GPU performance metrics to Prometheus. To configure it for advanced metrics:
Create the metrics configuration file and save it as
dcgm-metrics.csv:Create the following Helm values file and save it as
extended-dcgm-metrics-values.yaml:Run the following to create the ConfigMap and upgrade the GPU operator:
Enabling NVIDIA Run:ai Metric Aggregation
Enable NVIDIA Run:ai to create enriched metrics from the DCGM profiling data. This configures Prometheus recording rules that aggregate raw DCGM metrics per pod, workload, and node. See Advanced cluster configurations for more details.
Using Helm - Set the following value in your
values.yamlfile underclusterConfigand upgrade the chart:Using runaiconfig at runtime - Use the following
kubectlpatch command:
Enabling GPU Profiling Metrics Settings
GPU profiling metrics are disabled by default. To enable:
Go to General settings and navigate to Analytics
Enable GPU profiling metrics
Metrics become visible under the Workloads and Nodes pages once active
Verification
Workloads:
Navigate to Workload manager → Workloads
Click a row in the Workloads table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Metrics tab.
In the Type dropdown, verify the GPU profiling option is available
Nodes:
Navigate to Resources → Nodes
Click a row in the Nodes table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Metrics tab.
In the Type dropdown, verify the GPU profiling option is available
Last updated