GPU Profiling Metrics

This guide describes how to enable advanced GPU profiling metrics from NVIDIA Data Center GPU Manager (DCGM). These metrics provide deep visibility into GPU performance, including SM utilization, memory bandwidth, tensor core activity, and compute pipeline behavior - extending beyond standard GPU utilization metrics. For more details on metric definitions, see NVIDIA profiling metrics.

Available Metrics

Once enabled, NVIDIA Run:ai exposes the following GPU profiling metrics:

  • SM activity - SM active cycles and occupancy

  • Memory bandwidth - DRAM active cycles

  • Compute pipelines - FP16/FP32/FP64 and tensor core activity

  • Graphics engine - GR engine utilization

  • PCIe/NVLink - Data transfer rates

NVIDIA Run:ai automatically aggregates these metrics at multiple levels:

  • Per GPU device

  • Per pod

  • Per workload

  • Per node

Configuring the DCGM Exporter

The DCGM Exporter is responsible for exposing GPU performance metrics to Prometheus. To configure it for advanced metrics:

  1. Create the metrics configuration file and save it as dcgm-metrics.csv:

  2. Create the following Helm values file and save it as extended-dcgm-metrics-values.yaml:

  3. Run the following to create the ConfigMap and upgrade the GPU operator:

Enabling NVIDIA Run:ai Metric Aggregation

Enable NVIDIA Run:ai to create enriched metrics from the DCGM profiling data. This configures Prometheus recording rules that aggregate raw DCGM metrics per pod, workload, and node. See Advanced cluster configurations for more details.

  • Using Helm - Set the following value in your values.yaml file under clusterConfig and upgrade the chart:

  • Using runaiconfig at runtime - Use the following kubectl patch command:

Enabling GPU Profiling Metrics Settings

GPU profiling metrics are disabled by default. To enable:

  1. Go to General settings and navigate to Analytics

  2. Enable GPU profiling metrics

  3. Metrics become visible under the Workloads and Nodes pages once active

Verification

Workloads:

  1. Navigate to Workload manager → Workloads

  2. Click a row in the Workloads table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Metrics tab.

  3. In the Type dropdown, verify the GPU profiling option is available

Nodes:

  1. Navigate to Resources → Nodes

  2. Click a row in the Nodes table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Metrics tab.

  3. In the Type dropdown, verify the GPU profiling option is available

Last updated