Metrics Store Requirements
In a multi-tenant deployment, integrating a multi-tenant metrics store to is required to support:
Tenant usage reporting
System and workload monitoring
NVIDIA Run:ai components rely on Prometheus-compatible metrics for dashboards, APIs, and backend decision-making. The metrics store must:
Support PromQL (Prometheus Query Language)
Scale to support multiple tenants concurrently
Retain time-series data reliably for reporting and analysis
Supported Backends
Grafana Labs (hosted Prometheus)
Recommended managed service. Offers multi-tenancy, scalability, and ease of integration with NVIDIA Run:ai.
Grafana Mimir (self-hosted)
Supported self-managed alternative. The host organization is responsible for installation, configuration, and maintenance.
NVIDIA Run:ai components will query metrics from the configured store to power dashboards, reports, and scheduling decisions. Ensure availability and performance SLAs are in place.
Connecting Grafana Labs (Hosted Prometheus)
To connect NVIDIA Run:ai to Grafana Labs (hosted Prometheus), you need a Grafana Cloud Access Policy token. This token authenticates API requests and enables secure access to your metrics data.
Create an access token following the Create access policies and tokens – Grafana Cloud Docs.
Create a values file (e.g.,
grafanalabs-values.yaml
) with your hosted Prometheus endpoint and access token.Add the file during control plane installation:
thanos:
enabled: false
tenantsManager:
config:
grafanaLab:
accessToken: <<GRAFANA LAB ACCESS TOKEN>>
This configuration tells the NVIDIA Run:ai platform where to send PromQL queries for tenant insights and metrics.
Grafana Mimir Integration
If you choose to use Grafana Mimir as your metrics store, follow the steps below to ensure compatibility, security, and observability.
Installation and Configuration
Follow the official Grafana Mimir Helm chart documentation for installation. NVIDIA Run:ai has optimized Mimir compatibility. You can review all configurable options here: Mimir Configuration Parameters.
Prerequisites
Make sure you have the following before deploying Mimir:
TLS certificate (private and public) - Used to secure HTTPS access to the metrics store. This should be a dedicated certificate specifically for the Mimir deployment.
FQDN for Mimir access (e.g., mimir.runai.hostorg.com) - This must resolve to the Mimir service endpoint. Use a dedicated domain reserved for the metrics store.
Helm Values Template
NVIDA Run:ai provides a tested values.yaml
configuration for Helm-based Mimir installation. See NVIDIA Run:ai Mimir Helm Chart.
Connecting Mimir to the NVIDIA Run:ai Control Plane
To integrate Mimir with the NVIDIA Run:ai control plane, include the required Mimir specific configuration values in your values.yaml
file when installing or upgrading the control plane. See Install the control plane for more details:
metricsService:
config:
datasourceUrl: <METRIC_STORE_READ_URL> # example: http://mimir-query-frontend.monitoring.svc:8080/prometheus
tenantsManager:
config:
defaultMetricStore:
read:
auth:
basic:
password: ''
username: ''
url: <METRIC_STORE_READ_URL> # example: http://mimir-query-frontend.monitoring.svc:8080/prometheus
useXscopeHeader: true
write:
auth:
basic:
password: ''
username: ''
url: <METRIC_STORE_WRITE_URL> # example: http://mimir-distributor.monitoring.svc:8080/api/v1/push
thanos:
enabled: false
Monitoring and Debugging Mimir
Grafana Labs offers a collection of dashboards and alerts for monitoring a self-hosted Mimir. NVIDIA Run:ai utilizes these dashboards to monitor our Mimir instance. To deploy these dashboards and alerts, see About Grafana Mimir dashboards and alerts requirements.
Last updated