GPU Profiling Metrics

This guide describes how to enable advanced GPU profiling metrics from NVIDIA Data Center GPU Manager (DCGM). These metrics provide deep visibility into GPU performance, including SM utilization, memory bandwidth, tensor core activity, and compute pipeline behavior - extending beyond standard GPU utilization metrics. For more details on metric definitions, see NVIDIA profiling metrics.

Available Metrics

Once enabled, NVIDIA Run:ai exposes the following GPU profiling metrics:

SM activity - SM active cycles and occupancy
Memory bandwidth - DRAM active cycles
Compute pipelines - FP16/FP32/FP64 and tensor core activity
Graphics engine - GR engine utilization
PCIe/NVLink - Data transfer rates

NVIDIA Run:ai automatically aggregates these metrics at multiple levels:

Per GPU device
Per pod
Per workload
Per node

Configuring the DCGM Exporter

The DCGM Exporter is responsible for exposing GPU performance metrics to Prometheus. To configure it for advanced metrics:

Create the metrics configuration file and save it as dcgm-metrics.csv:

# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.

# Memory
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# NVLink
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,    counter, The number of bytes of active NVLink rx or tx data including both header and payload.

# vGPU
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

# Labels
DCGM_FI_DRIVER_VERSION, label, Driver Version

# DCP Profiling Metrics (Advanced)
DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
DCGM_FI_PROF_NVLINK_TX_BYTES,    gauge, The number of bytes of active NvLink tx (transmit) data including both header and payload.
DCGM_FI_PROF_NVLINK_RX_BYTES,    gauge, The number of bytes of active NvLink rx (read) data including both header and payload

Create the following Helm values file and save it as extended-dcgm-metrics-values.yaml:

dcgmExporter:
  config:
    name: metrics-config
  env:
    - name: DCGM_EXPORTER_COLLECTORS
      value: /etc/dcgm-exporter/dcgm-metrics.csv

Run the following to create the ConfigMap and upgrade the GPU operator:

# Get GPU Operator version
GPU_OPERATOR_VERSION=$(helm ls -A | grep gpu-operator | awk '{ print $10 }')

# Create ConfigMap with metrics configuration
kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv

# Upgrade GPU Operator with new configuration
helm upgrade -i gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --version $GPU_OPERATOR_VERSION \
  --reuse-values \
  -f extended-dcgm-metrics-values.yaml

Enabling NVIDIA Run:ai Metric Aggregation

Enable NVIDIA Run:ai to create enriched metrics from the DCGM profiling data. This configures Prometheus recording rules that aggregate raw DCGM metrics per pod, workload, and node. See Advanced cluster configurations for more details.

Using Helm - Set the following value in your values.yaml file under clusterConfig and upgrade the chart:

clusterConfig:  
  prometheus:
    spec: # PrometheusSpec
      config:
        advancedMetricsEnabled: true

Using runaiconfig at runtime - Use the following kubectl patch command:

kubectl patch runaiconfig runai -n runai \
  --type=merge \
  -p '{
    "spec": {
      "prometheus": {
        "config": {
          "advancedMetricsEnabled": true
        }
      }
    }
  }'

Enabling GPU Profiling Metrics Settings

GPU profiling metrics are disabled by default. To enable:

Go to General settings and navigate to Analytics
Enable GPU profiling metrics
Metrics become visible under the Workloads and Nodes pages once active

Verification

Workloads:

Navigate to Workload manager → Workloads
Click a row in the Workloads table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Metrics tab.
In the Type dropdown, verify the GPU profiling option is available

Nodes:

Navigate to Resources → Nodes
Click a row in the Nodes table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Metrics tab.
In the Type dropdown, verify the GPU profiling option is available

# Check DCGM exporter pods are running
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

# Check for advanced metrics in Prometheus (if accessible)

# Look for metrics like: DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_DRAM_ACTIVE

# And Run:ai enriched metrics like: runai_gpu_sm_active_per_pod_per_gpu

PreviousMonitor Workloads by Category NextMetrics and Telemetry

Last updated 1 month ago

Good afternoon

hashtagAvailable Metrics

hashtagConfiguring the DCGM Exporter

hashtagEnabling NVIDIA Run:ai Metric Aggregation

hashtagEnabling GPU Profiling Metrics Settings

hashtagVerification

Available Metrics

Configuring the DCGM Exporter

Enabling NVIDIA Run:ai Metric Aggregation

Enabling GPU Profiling Metrics Settings

Verification