# GPU Profiling Metrics

This guide describes how to enable advanced GPU profiling metrics from **NVIDIA Data Center GPU Manager (DCGM)**. These metrics provide deep visibility into GPU performance, including SM utilization, memory bandwidth, tensor core activity, and compute pipeline behavior - extending beyond standard GPU utilization metrics. For more details on metric definitions, see [NVIDIA profiling metrics](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#metrics).

## Available Metrics

Once enabled, NVIDIA Run:ai exposes the following GPU profiling metrics:

* SM activity - SM active cycles and occupancy
* Memory bandwidth - DRAM active cycles
* Compute pipelines - FP16/FP32/FP64 and tensor core activity
* Graphics engine - GR engine utilization
* PCIe/NVLink - Data transfer rates

NVIDIA Run:ai automatically aggregates these metrics at multiple levels:

* Per GPU device
* Per pod
* Per workload
* Per node

## Configuring the DCGM Exporter

The **DCGM Exporter** is responsible for exposing GPU performance metrics to Prometheus. To configure it for advanced metrics:

1. Create the metrics configuration file and save it as `dcgm-metrics.csv`:

   ```csv
   # DCGM FIELD, Prometheus metric type, help message

   # Clocks
   DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
   DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

   # Temperature
   DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
   DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

   # Power
   DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
   DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

   # PCIE
   DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

   # Utilization
   DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
   DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
   DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
   DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

   # Errors
   DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.

   # Memory
   DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
   DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

   # NVLink
   DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
   DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,    counter, The number of bytes of active NVLink rx or tx data including both header and payload.

   # vGPU
   DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

   # Remapped rows
   DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
   DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
   DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

   # Labels
   DCGM_FI_DRIVER_VERSION, label, Driver Version

   # DCP Profiling Metrics (Advanced)
   DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
   DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
   DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
   DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
   DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
   DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
   DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
   DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
   DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
   DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
   DCGM_FI_PROF_NVLINK_TX_BYTES,    gauge, The number of bytes of active NvLink tx (transmit) data including both header and payload.
   DCGM_FI_PROF_NVLINK_RX_BYTES,    gauge, The number of bytes of active NvLink rx (read) data including both header and payload
   ```
2. Create the following Helm values file and save it as `extended-dcgm-metrics-values.yaml`:

   ```yaml
   dcgmExporter:
     config:
       name: metrics-config
     env:
       - name: DCGM_EXPORTER_COLLECTORS
         value: /etc/dcgm-exporter/dcgm-metrics.csv
   ```
3. Run the following to create the ConfigMap and upgrade the GPU operator:

   ```bash
   # Get GPU Operator version
   GPU_OPERATOR_VERSION=$(helm ls -A | grep gpu-operator | awk '{ print $10 }')

   # Create ConfigMap with metrics configuration
   kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv

   # Upgrade GPU Operator with new configuration
   helm upgrade -i gpu-operator nvidia/gpu-operator \
     -n gpu-operator \
     --version $GPU_OPERATOR_VERSION \
     --reuse-values \
     -f extended-dcgm-metrics-values.yaml
   ```

## Enabling NVIDIA Run:ai Metric Aggregation

Enable NVIDIA Run:ai to create enriched metrics from the DCGM profiling data. This configures Prometheus recording rules that aggregate raw DCGM metrics per pod, workload, and node. See [Advanced cluster configurations](/self-hosted/infrastructure-setup/advanced-setup/cluster-config.md#prometheus) for more details.

* **Using Helm** - Set the following value in your `values.yaml` file under `clusterConfig` and upgrade the chart:

  ```bash
  clusterConfig:  
    prometheus:
      spec: # PrometheusSpec
        config:
          advancedMetricsEnabled: true
  ```
* **Using runaiconfig at runtime** - Use the following `kubectl` patch command:

  ```bash
  kubectl patch runaiconfig runai -n runai \
    --type=merge \
    -p '{
      "spec": {
        "prometheus": {
          "config": {
            "advancedMetricsEnabled": true
          }
        }
      }
    }'
  ```

## Enabling GPU Profiling Metrics Settings

GPU profiling metrics are disabled by default. To enable:

1. Go to **General settings** and navigate to Analytics
2. Enable **GPU profiling metrics**
3. Metrics become visible under the **Workloads** and **Nodes** pages once active

## Verification

{% tabs %}
{% tab title="UI" %}
Workloads:

1. Navigate to **Workload manager** → Workloads
2. Click a row in the Workloads table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the **Metrics** tab.
3. In the **Type** dropdown, verify the **GPU profiling** option is available

Nodes:

1. Navigate to **Resources** → Nodes
2. Click a row in the Nodes table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the **Metrics** tab.
3. In the **Type** dropdown, verify the **GPU profiling** option is available
   {% endtab %}

{% tab title="CLI v2" %}

```bash
# Check DCGM exporter pods are running
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

# Check for advanced metrics in Prometheus (if accessible)

# Look for metrics like: DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_DRAM_ACTIVE

# And Run:ai enriched metrics like: runai_gpu_sm_active_per_pod_per_gpu
```

{% endtab %}

{% tab title="API" %}

* Workloads - To view the GPU profiling metrics per pod, refer to the [Pods](https://run-ai-docs.nvidia.com/api/2.25/workloads/pods) API
* Nodes - To view the GPU profiling metrics for nodes, refer to the [Nodes](https://run-ai-docs.nvidia.com/api/2.25/organizations/nodes) API
  {% endtab %}
  {% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://run-ai-docs.nvidia.com/self-hosted/platform-management/monitor-performance/gpu-profiling-metrics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
