GPU Profiling Metrics
This guide describes how to enable advanced GPU profiling metrics from NVIDIA Data Center GPU Manager (DCGM). These metrics provide deep visibility into GPU performance, including SM utilization, memory bandwidth, tensor core activity, and compute pipeline behavior - extending beyond standard GPU utilization metrics. For more details on metric definitions, see NVIDIA profiling metrics.
Available Metrics
Once enabled, NVIDIA Run:ai exposes the following GPU profiling metrics:
SM activity - SM active cycles and occupancy
Memory bandwidth - DRAM active cycles
Compute pipelines - FP16/FP32/FP64 and tensor core activity
Graphics engine - GR engine utilization
PCIe/NVLink - Data transfer rates
NVIDIA Run:ai automatically aggregates these metrics at multiple levels:
Per GPU device
Per pod
Per workload
Per node
Configuring the DCGM Exporter
The DCGM Exporter is responsible for exposing GPU performance metrics to Prometheus. To configure it for advanced metrics:
Create the metrics configuration file and save it as
dcgm-metrics.csv:# DCGM FIELD, Prometheus metric type, help message # Clocks DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz). DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz). # Temperature DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C). DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C). # Power DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W). DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ). # PCIE DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries. # Utilization DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %). DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %). DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %). DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %). # Errors DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered. # Memory DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB). DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB). # NVLink DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes. DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload. # vGPU DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status # Remapped rows DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed # Labels DCGM_FI_DRIVER_VERSION, label, Driver Version # DCP Profiling Metrics (Advanced) DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %). DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %). DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %). DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %). DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %). DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %). DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %). DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %). DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second. DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second. DCGM_FI_PROF_NVLINK_TX_BYTES, gauge, The number of bytes of active NvLink tx (transmit) data including both header and payload. DCGM_FI_PROF_NVLINK_RX_BYTES, gauge, The number of bytes of active NvLink rx (read) data including both header and payloadCreate the following Helm values file and save it as
extended-dcgm-metrics-values.yaml:dcgmExporter: config: name: metrics-config env: - name: DCGM_EXPORTER_COLLECTORS value: /etc/dcgm-exporter/dcgm-metrics.csvRun the following to create the ConfigMap and upgrade the GPU operator:
# Get GPU Operator version GPU_OPERATOR_VERSION=$(helm ls -A | grep gpu-operator | awk '{ print $10 }') # Create ConfigMap with metrics configuration kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv # Upgrade GPU Operator with new configuration helm upgrade -i gpu-operator nvidia/gpu-operator \ -n gpu-operator \ --version $GPU_OPERATOR_VERSION \ --reuse-values \ -f extended-dcgm-metrics-values.yaml
Enabling NVIDIA Run:ai Metric Aggregation
Enable NVIDIA Run:ai to create enriched metrics from the DCGM profiling data. This configures Prometheus recording rules that aggregate raw DCGM metrics per pod, workload, and node. See Advanced cluster configurations for more details.
Using Helm - Set the following value in your
values.yamlfile underclusterConfigand upgrade the chart:clusterConfig: prometheus: spec: # PrometheusSpec config: advancedMetricsEnabled: trueUsing runaiconfig at runtime - Use the following
kubectlpatch command:kubectl patch runaiconfig runai -n runai \ --type=merge \ -p '{ "spec": { "prometheus": { "config": { "advancedMetricsEnabled": true } } } }'
Enabling GPU Profiling Metrics Settings
GPU profiling metrics are disabled by default. To enable:
Go to General settings and navigate to Analytics
Enable GPU profiling metrics
Metrics become visible under the Workloads and Nodes pages once active
Verification
Workloads:
Navigate to Workload manager → Workloads
Click a row in the Workloads table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Metrics tab.
In the Type dropdown, verify the GPU profiling option is available
Nodes:
Navigate to Resources → Nodes
Click a row in the Nodes table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Metrics tab.
In the Type dropdown, verify the GPU profiling option is available
Last updated