Install Using Base Command Manager (BCM)

NVIDIA Run:ai installation via NVIDIA Base Command Manager (BCM) is intended to simplify deployment, employing defaults meant to enable most NVIDIA Run:ai capabilities on NVIDIA DGX SuperPOD systems. See Installation for alternative installation approaches.

Pre-Installation Checklist

The following checklist outlines infrastructure, networking, and security requirements that must be collected and validated before beginning an NVIDIA Run:ai deployment. It’s provided for convenience to help ensure all prerequisites are met.

It’s recommended to save all necessary prerequisite files, along with any generated during the deployment, in a secure location that is accessible from the BCM head node. This will ease the installation process and help facilitate any subsequent redeployment, upgrade, or debugging needs. The /cm/shared/runai/ directory is used and will be assumed throughout this document.

Requirement
Purpose
Example

1

IP Address

Reserved IP for NVIDIA Run:ai control plane ingress

10.1.1.25

2

IP Address

Reserved IP for NVIDIA Run:ai inference

10.1.1.26

3

Name Record

Fully Qualified Domain Name (FQDN) pointing to the reserved internal IP used for the NVIDIA Run:ai control plane

runai.mycorp.local → 10.1.1.25

4

Name Record

Wildcard FQDN pointing to the same IP used for control plane network access for NVIDIA Run:ai subdomain workspaces access

*.runai.mycorp.local → 10.1.1.25

5

Name Record

Wildcard FQDN pointing to a separate reserved internal IP for serving NVIDIA Run:ai inference workloads

*.runai-inference.mycorp.local → 10.1.1.26

6

TLS Certificate

Full certificate chain required for secure access to the control plane FQDN

/cm/shared/runai/full-chain.pem

7

TLS Private Key

Private key associated with the certificate.

Important note: The private key must be kept secure.

/cm/shared/runai/private.key

8

Local CA Certificate (optional)

Full trust chain (signing CA public keys) for organizations that cannot use publicly trusted certificate authority

/cm/shared/runai/ca.crt

9

NVIDIA Run:ai Registry Credential

NVIDIA token required to access the NVIDIA Run:ai container registry. Used for downloading container images and artifacts from https://runai.jfrog.io/.

/cm/shared/runai/credential.jwt

A single-line file containing the base64-encoded token and no other text. For example:

SwLyIsXCJzDqzC9tE6vik6xxZkKY9OmvjiVu5pqNLij2-rdKA0SiEfK49fR7J8Z0CaeVuEMVYAr74gtWQoUtI8LHehuO4n1RJHPEvAQJBCojelPkTQ_6-tcWQ6gMI51BX9ZgY0EDmdDTVg7PMsIbNltJT78TAwZB-mdmesyNLtHshrMLy_HySIl2faajnK5mzuXKIB6Sd9cb5Xm-HTgFSgVd9lxCSgZRpGojyQ2c5pEnX8r96TYe2ndiq6kgnvA4zF1DruGwsgU_dF61Aj3l0hOkYYrYasgl6P7LaZp8gwk_0byl4ZYfO6OWuC3UjHidCsz7sTyxHQlDUnmv-dnzJsmqgZqS7L5SczEocw0NHjtx98ox99P6l6 Note: For illustration only. The above example is not a valid token.

10

Time Synchronization

All nodes involved must be synchronized using Network Time Protocol (NTP)

11

BCM Node Labels

k8s-system-user, dgx-b200-k8s or dgx-gb200-k8s

Installation Preparation

Address Reservations and Name Records

Before installing NVIDIA Run:ai, make sure that at least the two IPs indicated in the pre-installation checklist table above are reserved, and the Fully Qualified Domain Names are properly associated and resolvable. Validating this before proceeding is recommended.

Reserve IP Addresses

Reserve at least two IP addresses from the internalnet network address block. All reserved IPs must be reachable within your internal network and cannot conflict with other allocations. These are critical for exposing the NVIDIA Run:ai control plane and inference services:

  • NVIDIA Run:ai control plane - Reserve one IP address for accessing core components such as the UI and API.

  • NVIDIA Run:ai workspaces subdomains access using the same IP address.

  • Inference - Reserve a second IP address specifically for serving inference workloads.

Note

For NVIDIA DGX SuperPOD and BasePOD, the IP addresses must be selected from the internalnet range that the control plane nodes reside on. DGX worker nodes reside on separate dedicated networks (typically prefixed with dgxnet). Be sure to avoid assigning addresses from those ranges.

The BCM BaseView network section (Networks > Network Entities > Actions) will depict which IP range is used for internalnet and which are presently being consumed. Additionally, the following indicates what pool is reserved for dynamic (DHCP) allocation within that range. Do not select IPs from either of what is being used or are reserved.

cmsh -c "network show internalnet; quit"

Parameter                        Value                                           
-------------------------------- ------------------------------------------------
Name                             internalnet                                                                       

Base address                     x.x.x.0
Dynamic range start              x.x.x.248                                    
Dynamic range end                x.x.x.254 
Netmask bits                     24  

Address Accessibility

As suggested by the name, internalnet is reserved for communication between control plane nodes and as described in SuperPOD documentation makes use of standard blocks of nonroutable addresses (RFC1918).

To ensure the reserved IP addresses are accessible, implementation planning must consider the intended use. Several options are available depending on requirements. For example, VPN or remote shell access with private name records (local to the device accessing the endpoints) to the environment may suffice. For delivering production inference serving to an external audience, alternative routable address blocks with firewall coverage limiting inbound access to the ingress IP and relevant ports only may be necessary. For detailed guidance, see the network requirements to plan implementation accordingly.

DNS Records

A Fully Qualified Domain Name (FQDN) is required to install the NVIDIA Run:ai control plane (e.g., runai.mycorp.local). This cannot be an IP address alone. The domain name must minimally be resolvable inside the organization's private network. The FQDN must point to the control plane’s reserved IP, either:

  • As a DNS (A record) pointing directly to the IP

  • Or, a CNAME alias to a host DNS record pointing to that same IP address

Enablement of NVIDIA Run:ai workspaces accessibility via subdomains are established by creating an additional DNS (A record) wildcard pointing to the IP for the NVIDIA Run:ai control plane.

For inference workloads, additionally configure a wildcard DNS record that maps to the reserved inference ingress IP address. This ensures each inference workload is accessible at a unique subdomain.

For example:

IP Address
FQDN
Purpose

10.1.1.25

runai.mycorp.local

Accessing the NVIDIA Run:ai control plane (UI and API)

10.1.1.25

*.runai.mycorp.local

NVIDIA Run:ai workspaces subdomains

10.1.1.26

*.runai-inference.mycorp.local

Serving endpoints for Inference workloads

Note

Name resolution should be validated prior to installing. For example:

# Validate Name Resolution

dig A runai.mycorp.local

;; ANSWER SECTION:
runai.mycorp.local.	60	IN	A	10.1.1.25

Certificates for Secure Communication

These certificates (commonly called TLS/SSL certificates, in the standard X.509 format) establish trust by proving a server’s identity and enabling encrypted communication between clients and services. They are required to secure communication between the NVIDIA Run:ai control plane, workload elements, and services, and are used to authenticate system components while encrypting data in transit.

There are four main categories of certificates that may be leveraged in the installation, refer to the TLS certificates section for more details.

NVIDIA Run:ai Certificate Requirements

You must have TLS X.509 certificates that are issued for the Fully Qualified Domain Names (FQDN) of the NVIDIA Run:ai control plane and Inference ingress. The certificate’s Common Name (CN) must match the FQDN and the Subject Alternative Name (SAN) must also include the FQDN. The following should be provided:

  • The full certificate chain (e.g., /cm/shared/runai/full-chain.pem)

  • The private key associated with the certificate (e.g., /cm/shared/runai/private.key). The private key must be kept secure. It proves the server’s identity and is required to decrypt TLS traffic. If compromised, an attacker could impersonate the service or read encrypted communications.

# Run the following from the /cm/shared/runai/ directory

# Certificate Verification 
openssl verify -CAfile ./rootCA.pem -verify_hostname runai.mycorp.local ./runai.crt

# Inspecting Certificate
openssl x509 -in ./runai.crt -text -noout | grep -A 5 "Subject Alternative Name"

NVIDIA Run:ai Registry Credentials

To access the NVIDIA Run:ai container registry to obtain installation artifacts, you will receive a token from NVIDIA. Take the token and paste it into your /cm/shared/runai/credential.jwt file to be used during the BCM installation assistant.

# for illusration only, not a valid Base64 token

less /cm/shared/runai/credential.jwt

wLyIsXCJzDqzC9tE6vik6xxZkKY9OmvjiVu5pqNLij2-rdKA0SiEfK49fR7J8Z0CaeVuEMVYAr74gtWQoUtI8L
HehuO4n1RJHPEvAQJBCojelPkTQ_6-tcWQ6gMI51BX9ZgY0EDmdDTVg7PMsIbNltJT78TAwZB-mdmesyNLtHsh
rMLy_HySIl2faajnK5mzuXKIB6Sd9cb5Xm-HTgFSgVd9lxCSgZRpGojyQ2c5pEnX8r96TYe2ndiq6kgnvA4zF1
DruGwsgU_dF61Aj3l0hOkYYrYasgl6P7LaZp8gwk_0byl4ZYfO6OWuC3UjHidCsz7sTyxHQlDUnmv-dnzJsmqg
ZqS7L5SczEocw0NHjtx98ox99P6l6

The installation assistant will request this token and it can be provided by either pasting it into the field or specifying a file path. Placing this token, the certificates mentioned above, configuration files, and any other deployment artifacts in the /cm/shared/runai/ directory is suggested. This directory resides on a shared mountpoint accessible from all nodes.

SuperPOD Installation Architecture

Hardware Requirements

This guide describes deploying NVIDIA Run:ai via BCM installation assistant on DGX SuperPOD systems. All necessary hardware elements are provided for in the bill of materials for DGX GB200 and B200 systems and later. Refer to the system requirements for other platforms.

NVIDIA Base Command Manager (BCM)

BCM offers centralized tools for provisioning, monitoring, and managing DGX SuperPODs, BasePODs, and other GPU-accelerated clusters. This includes scaling and managing DGX node lifecycle, configuration of network topologies, storage configurations, and more. In the context of this guide, it’s the nexus for management of underpinning hardware, OS images, underlay network topology, storage configuration, and streamlining deployment of NVIDIA Run:ai on NVIDIA DGX SuperPOD.

BCM Head Node

The Head Node is the primary management host for Base Command Manager and may be installed on two separate hosts (primary and alternate) with a shared virtual IP address for high availability. The BCM cm-kubernetes-setup installation assistant will configure a proxy at the head nodes that assists in providing and securing common access to the Kubernetes clusters it provisions.

BCM Node Categories

In BCM, a node category is a way to group nodes that share the same configuration (e.g based on hardware profile and intended purpose). Defining node categories and an associated software image with each allows the system to assign the appropriate software image and configurations to each group.

Before installing NVIDIA Run:ai, prepare the following BCM node categories:

Category Type
Category Name

k8s-system-user

Required

dgx-b200-k8s, dgx-gb200-k8s

Required

Dedicated NVIDIA Run:ai system nodes

runai-system

Optional

Dedicated NVIDIA Run:ai CPU worker nodes

runai-cpu-worker

Optional

Dedicated etcd nodes

k8s-etcd-user

Optional

Note

These categories are established in BCM and will be employed in the installation assistant for NVIDIA Run:ai. In the case of the optional categories, NVIDIA Run:ai administrators can optionally establish node roles mapping to these as needed, but the installation assistant does not automate mapping or synchronization of BCM node categories to NVIDIA Run:ai node roles.

The BCM 11 Administrator Manual provides background for creating and managing node categories, software images, provisioning nodes and more. For DGX SuperPOD, refer to the "Category Creation and Software Image Setup" sections of the Installation Guide for specific instructions on how to prepare Node Categories.

# DGX GB200 SuperPOD Example
# Nodes can either be ARM or x86 based - it's recommended to
# include the architecture in the Software Image name 

cmsh -c "category list; exit"

Name (key)                  Software image                   Nodes   
--------------------------- -------------------------------- --------    
dgx-gb200-k8s               dgx-image-gb200-k8s              12              
k8s-system-user             k8s-system-user-image            3
k8s-system-admin            k8s-system-admin-image           3

Installation Assistant

The BCM cm-kubernetes-setup installation assistant automates deployment through a Terminal User Interface (TUI) wizard. The wizard inquires about environment-specific details, deploys and configures the required subcomponents, and completes with a functioning self-hosted NVIDIA Run:ai capability. The deployment uses opinionated defaults for DGX SuperPOD systems intended to deliver ease of use making available the broadest set of features.

Kubernetes

NVIDIA Run:ai is built on Kubernetes, complementary cloud-native components (e.g. Prometheus, Knative), and NVIDIA Kubernetes enabling software (e.g. GPU Operator, Network Operator). These will be installed and configured by the cm-kubernetes-setup installation assistant.

On DGX GB200 SuperPOD and later systems, a separate Kubernetes cluster will also be present and should precede the installation of NVIDIA Run:ai. This cluster delivers the capability of NVIDIA Mission Control (NMC) and is not intended for deployment of other workloads or non-administrative access. It’s important to be aware of this cluster in relation to the dedicated NVIDIA Run:ai that will be created as described in this guide.

NVIDIA Run:ai Kubernetes Cluster

The system components section indicates that NVIDIA Run:ai is primarily made up of two components installed over Kubernetes (namely the NVIDIA Run:ai cluster and the NVIDIA Run:ai control plane). The installation assistant described in this guide combines both within a single Kubernetes cluster. Additionally, the steps described here make use of the same CPU nodes to combine and co-mingle the Kubernetes control plane with etcd and the NVIDIA Run:ai control plane.

Load Balancing

The BCM NVIDIA Run:ai deployment assistant makes use of a load balancer (MetalLB) within the Kubernetes cluster to provide overall service resiliency and consistent access to NVIDIA Run:ai endpoints.

Distinguishing Between Clusters

The naming conventions that are used to distinguish between the two:

k8s-admin

NVIDIA Mission Control

k8s-user

NVIDIA Run:ai

The following commands (executed from the BCM head node) are recommended for switching between clusters via the command line:

# list available Kubernetes modules
module avail
-------------- /cm/local/modulefiles -----------------------

kubernetes/k8s-admin/1.32.9-1.1
kubernetes/k8s-user/1.32.9-1.1

# display which modules are presently loaded
module list


# set access to a cluster 
module load kubernetes/k8s-user/1.32.9-1.1

# switch between Kubernetes clusters
module swap kubernetes/k8s-admin kubernetes/k8s-user

Shared Storage

Workload Assets

NVIDIA Run:ai workloads must be able to access data from any worker node in a uniform way to access training data and code as well as save checkpoints, weights, and other machine learning-related artifacts. After completing NVIDIA Run:ai deployment via BCM as described in this guide, data sources should be established within NVIDIA Run:ai on shared storage.

When installing atop NVIDIA DGX SuperPOD and BasePOD systems, please consult reference architecture documentation for certified storage options and documentation from your storage vendor for CSI provider and StorageClass configuration to supply high performance data sources.

System Installation

This guide refers to the use of shared storage to support Kubernetes, NVIDIA Run:ai and certain associated required services as part of the BCM deployment assistant. Preceding installation instructions aren’t covered in this guide, but it’s recommended to consult NFS server vendor guidance to ensure proper configuration of exports and mount parameters for performance, resilience, and data integrity aligned to service level requirements.

BCM Version

The instructions in this document are specific to BCM 11, with a minimum required version of 11.25.08.

Installation

  1. Before you begin with the installation, make sure you have reviewed the Installation preparation section and completed all tasks indicated in the Pre-installation checklist.

  2. Access the active BCM head node via ssh:

    ssh root@<IP address of BCM head node>
  3. Verify the BCM version:

    ccm-package-release-info -f cm-setup,cmdaemon
    
    Name      Version    Release(s)
    --------  ---------  ------------
    cm-setup  123245     11.25.08
    cmdaemon  163415     11.25.08
  4. Create the following files in the /cm/shared/runai/ directory populating each respectively from the linked content:

  5. Verify that all files from the installation preparation and the step above have been created and are present:

    root@bcm11-headnode:~# ls -1 /cm/shared/runai/*
    
    credential.jwt
    netop-values.yaml 
    nic-cluster-policy.yaml 
    combined-ippools-gb200.yaml
    combined-sriovibnet-gb200.yaml
    dra-test-gb200.yaml
    ib-test-gb200.yaml
    full-chain.pem
    private.key
    ca.crt # optional
  6. Run the following command to initiate deployment via an interactive command-line assistant:

    cm-kubernetes-setup
  7. Select Deploy Kubernetes installation wizard and click Ok to proceed. If cm-kubernetes-setup is being run from GB200, refer to the second screenshot:

  1. Select the relevant Kubernetes version. This guide, employing Base Command Manager 11.25.08, is based on and requires Kubernetes 1.32. Click Ok to proceed:

  1. The next step inquires if there’s a Docker Hub registry mirror available. It’s recommended that a local registry mirror be employed when available. For the purpose of this guide, leave the default value (blank) and click Ok to proceed:

  1. Insert values for the new Kubernetes cluster that NVIDIA Run:ai will be installed into. Click Ok to proceed:

    • The Kubernetes cluster name should be a short, unique name that can be used to distinguish between multiple clusters (i.e. k8s-user).

    • The k8s-user.local value for Kubernetes domain name is the default value for internal (within the Kubernetes cluster) name resolution and service discovery. It should be unique to distinguish it from the NMC cluster on DGX GB200 and later SuperPODs. Common practice is to avoid using the same domain for the internal Kubernetes domain name and eternally referenceable FQDN to avoid potential name resolution inconsistencies.

    • The Kubernetes external FQDN field refers to the domain name that the Kubernetes API Server will be proxied at and will be automatically populated by BCM. If a valid name record (FQDN) for the BCM head node has been established prior that should be entered here. Please see the reference architecture section of the BCM Containerization Manual for details on how this is implemented via an NGINX proxy.

    • The Service network base address, Service network netmask bits, Pod network base address, & Pod network netmask bits fields provide CIDR ranges for Kubernetes service and pod networks. These will be pre-populated (taking care to avoid overlapping ranges from networks known to BCM) from private, non-routable ranges.

  1. The next step asks about exposing the Kubernetes API server to the external network. Select no and click Ok to proceed:

  1. The preferred internal network is used for Kubernetes intercommunication between ctrl plane and worker nodes. Select internalnet for the preferred internal network and click Ok to proceed:

  1. Select 3 or more Kubernetes master nodes. These should be the same nodes assigned to the control plane category. The screenshot below is for illustration only - the correct category should be k8s-system-user. See the BCM Node Categories section for more information. Click Ok to proceed:

Note

To ensure high availability and prevent a single point of failure, it is recommended to configure at least three Kubernetes master nodes in your cluster. The nodes selected at this stage will be employed to serve the needs of the control plane and should be located on CPU nodes. In contemporary Kubernetes versions, “master nodes” are referred to as control plane nodes.

  1. Select the worker node categories to operate as the Kubernetes worker nodes. The screenshot below is for illustration only - the correct category should be either dgx-gb200-k8s or dgx-b200-k8s and k8s-system-user. See the BCM Node Categories section for more information. Click Ok to proceed:

Note

Both the control plane nodes and the DGX nodes must be selected. Selecting the control plane nodes here allows select NVIDIA Run:ai services to run on the control plane nodes. If the cluster configuration has dedicated NVIDIA Run:ai system nodes as described in the optional Node Category section select that category here instead.

  1. Skip the selection of individual Kubernetes worker nodes (the category selected in the previous step will be used instead). Click Ok to proceed:

Note

In the combined steps 13 and 14 above, you must select from either:

  • A “node category” only (as described in this guide as k8s-system-user)

  • “Individual Kubernetes nodes” only (not generally recommended)

  • Or, a combination of both

  1. Select nodes for deploying etcd on. Make sure to select the same three nodes as the Kubernetes control plane nodes (Step 13). Click Ok to proceed:

  1. Leave the API server proxy port and etcd spool directory values at their prepopulated values (do not modify them). Click Ok to proceed:

Note

If there are multiple Kubernetes clusters being managed by BCM (such as in the case of DGX GB200 and later SuperPODs), the default proxy port value will automatically be incremented to avoid an overlap with existing clusters and may not match the screenshot.

  1. Select Calico as the Kubernetes network plugin. Click Ok to proceed:

  1. Select no and click Ok to proceed:

  1. The components selected in this screen represent those required by NVIDIA Run:ai for a self-hosted installation. Select the operator and NVIDIA Run:ai self-hosted options as depicted below. Click Ok to proceed:

  1. Provide the NVIDIA Run:ai configuration with the below and click Ok to proceed:

    • Run:ai Registry Credentials - Enter the path to a file containing the base64-encoded NVIDIA token. Alternatively the Base64 encoded value can be pasted in directly.

    • Run:ai Control Plane Domain Name (FQDN) - Enter the Run:ai control plane’s fully qualified domain name (e.g., runai.mycorp.local). This value should be different from the FQDN entered on the first “Insert basic values” Kubernetes setup in Step 10. It should be what was used when creating certificates (and should not be the same as the BCM head node hostname).

    • Local CA Cert Path (.crt or .pem) - Path to the root CA certificate file if you are using a local CA–issued certificate (common in testing or internal environments). It’s optional if using a certificate from a public CA.

    • Domain Cert Path (.crt/.pem) - Path to the full-chain certificate for your domain (the domain’s leaf certificate followed by any intermediate certificates).

    • Domain Cert Key Path (.key) - Path to the private key that matches the domain certificate.

Note

It’s recommended to save all certificates, configuration files, and deployment artifacts into a persistent and accessible location in case of redeployment. The /cm/shared/runai/ directory referred to in this guide resides on a shared mount point and would be a suitable location. See the TLS certificates section for additional clarification.

  1. Select yes to install NVIDIA Run:ai components. Click Ok to proceed:

Note

In this version of the BCM installation assistant, a warning dialog indicating an ssh issue will follow - disregard and click Ok to proceed. Other indications at this stage may indicate a problem with the certs supplied.

  1. Select the k8s-system-user node category for the NVIDIA Run:ai control plane nodes and click Ok to proceed:

  1. Select the required NVIDIA GPU Operator version (v25.3.2). Click Ok to proceed:

  1. Select the required Network Operator (v25.4.0) version. Click Ok to proceed:

  1. Select the required NVIDIA Run:ai version. Click Ok to proceed:

  1. When prompted to supply a Custom YAML config for the GPU Operator leave the default (blank) and click Ok to proceed:

  1. Configure the NVIDIA GPU Operator by selecting the following configuration parameters. Click Ok to proceed:

  1. Supply the path to the netop-values.yaml file that was created before. Click Ok to proceed:

  1. Click Ok for the MetalLB IP address pools page and it will automatically set up the requirements for NVIDIA Run:ai:

  1. Specify the ingress IP addresses prepared as documented in the Pre-installation checklist section. The mention of MetalLB here is an indication that these will be set up as part of a load balanced pool and assigned to each respective ingress. Click Ok to proceed:

  1. Select no to expose the Kubernetes Ingress to the default HTTPS port. Click Ok to proceed:

  1. Leave the node ports for the Ingress NGINX Controller at the pre-populated values (do not modify them) and click Ok to proceed:

  1. Select the serving option in the Knative Operator components dialog. Click Ok to proceed:

  1. If deploying onto an A100 or H100 only cluster, select yes. If deploying on any other cluster configuration select no. Click Ok to proceed:

Note

If applicable, Network Operator policies for DGX B200 or DGX GB200 systems will be applied in a post-deployment step described below.

  1. If yes was selected for the previous step, select the appropriate option for the cluster and click Ok to proceed. If no was selected for the previous step, this page will not appear:

  1. Select yes to install the Permission Manager. Click Ok to proceed:

Note

The BCM Permission Manager coordinates security policy, system accounts, RBAC, and configures Kubernetes to employ BCM LDAP for user accounts. BCM User Accounts, however, are not automatically represented within NVIDIA Run:ai. For assistance with configuring NVIDIA Run:ai, see Set Up SSO with OpenID Connect. For more information on the BCM Permission Manager, see Containerization Manual documentation.

  1. Select Local path as the Kubernetes StorageClass. Ensure that both enabled and default are specified. Click Ok to proceed:

Note

The indication “local path” in the installation assistant may imply that local storage is employed, but those paths are pointing to NFS mountpoints. These were mounted as part of standard BCM node provisioning (e.g. /cm/shared/home).

  1. Configure the CSI Provider (local-path-provisioner) to employ shared storage (/cm/shared/apps/kubernetes/k8s-user/var/volumes as a default). Click Ok to proceed:

  1. Select yes to enable local persistent storage for Grafana. Click Ok to proceed:

  1. Select Save config, set an accessible location for the config file (for example: /cm/share/runai/cm-kubernete-setup.conf) with the rest of the config files and then click Ok. Select Exit and Ok to complete the wizard and return to the terminal:

The deployment process may require an extended period (60+ minutes).. In order to prevent potential interruptions, failures, or network outages from disrupting the deployment process it’s recommended to perform the deployment from a persistent terminal session such as tmux or screen.

# Start a new screen session named "install_runai" 
# This allows detach/reattaching safely during the installation
screen -S install_runai

# Inside the screen session: run the cluster setup using the configuration file 
cm-kubernetes-setup -c /cm/shared/runai/cm-kubernetes-setup.conf

Note

During the deployment process all nodes that are members of the new Kubernetes cluster will be rebooted.

Connect to NVIDIA Run:ai User Interface

  1. Open your browser and go to: https://<DOMAIN>

  2. Log in using the default credentials:

You will be prompted to change the password.

Post-wizard Deployment Steps

After the BCM installation assistant completes additional steps are required.

If multiple Kubernetes clusters are configured in this instance of BCM, load the correct Kubernetes module before running all post-wizard commands:

module unload kubernetes
module load kubernetes/k8s-user

MPI Operator

Install the MPI Operator v0.6.0 or later by running the following command:

kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml --force-conflicts

# Validate MPIJob CRD is installed
kubectl get crd mpijobs.kubeflow.org  
> NAME                   CREATED AT
> mpijobs.kubeflow.org   2025-09-10T20:57:42Z

NVIDIA Dynamic Resource Allocation (DRA) Driver

The NVIDIA DRA Driver for GPUs extends how NVIDIA GPUs are consumed within Kubernetes. This is required to enable secure Internode Memory Exchange (IMEX) on Multi-Node NVLink (MNNVL) systems (e.g. GB200 and similar) for Kubernetes workloads and should be included with all NVIDIA GPU systems.

  1. Install using Helm:

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update
    helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
    --version="25.3.0" \
    --create-namespace \
    --namespace nvidia-dra-driver-gpu \
    --set nvidiaDriverRoot=/ \
    --set resources.gpus.enabled=false
  2. GB200 only - Create a file in /cm/shared/runai from dra-test-gb200.yaml and update the clique ID to match the clique ID from the cluster. Note that the following test addresses single rack NVL72 clusters. For multi-rack systems, you’ll need to adjust podAffinity (e.g. topologyKey: nvidia.com/gpu.clique).

    kubectl describe nodes | grep nvidia.com/gpu.clique=
    >                    nvidia.com/gpu.clique=f84d133c-bbc9-55fd-b1ff-ffffc7ef6783.23322
    
    # For DGX GB200 systems
    kubectl apply -f /cm/shared/runai/dra-test-gb200.yaml
  3. GB200 only - Validate the test successfully completed and inspect the logs of the launcher:

    # For GB200
    kubectl get pods
    > NAME                              READY   STATUS      RESTARTS   AGE
    > nvbandwidth-test-launcher-snb82   0/1     Completed   0          72s
    
    kubectl logs nvbandwidth-test-launcher-snb82
  4. GB200 only - Cleanup test:

    #For GB200
    kubectl delete -f dra-test-gb200.yaml

The default NVIDIA Run:ai configuration does not expose DRA features. After installing the DRA components, this can be enabled by modifying the runaiconfig in the cluster. See Advanced cluster configurations for more details:

# Edit the runaiconfig object to toggle GPUNetworkAccelerationEnable
# to true and adjust tolerations for the Kubernetes control plane

kubectl patch runaiconfig runai \
  -n runai \
  --context=kubernetes-admin@k8s-user \
  --type='merge' \
  -p '{
    "spec": {
      "workload-controller": {
        "GPUNetworkAccelerationEnabled": true
      },
      "global": {
        "tolerations": [
          {
            "key": "node-role.kubernetes.io/control-plane",
            "operator": "Exists",
            "effect": "NoSchedule"
          }
        ]
      }
    }
  }'

Instructions for validating the change and reverting if necessary:

# Validate the patch was applied successfully


kubectl get runaiconfig runai \
  -n runai \
  --context=kubernetes-admin@k8s-user \
  -o custom-columns=GPUAccelEnabled:.spec.workload-controller.GPUNetworkAccelerationEnabled,Tolerations:.spec.global.tolerations

# To revert the runaiconfig object change

kubectl patch runaiconfig runai -n runai --type='merge' -p '{
  "spec": {
    "workload-controller": {
      "GPUNetworkAccelerationEnabled": false
    },
    "global": {
      "tolerations": null
    }
  }
}'

Configure the Network Operator for B200 and GB200 Systems

In version 11.25.08 of the BCM installation assistant, the Network Operator requires additional configuration on DGX B200 and GB200 SuperPOD / BasePOD systems. While the operator is installed in a preceding step, it does not automatically initialize or configure SR-IOV and secondary network plugins.

The following CRD resources have to be created in the exact order as below:

  • SR-IOV Network Policies for each NVIDIA InfiniBand NIC

  • An nvIPAM IP address pool

  • SR-IOV InfiniBand networks

  1. Create SR-IOV network node policies using the nic-cluster-policy.yaml that was created in an earlier step:

    kubectl apply -f /cm/shared/runai/nic-cluster-policy.yaml
  2. Create an IPAM IP Pool using the respective combined-ippools-gb200.yaml or combined-ippools-b200.yaml that were created in an earlier step:

    kubectl apply -f /cm/shared/runai/combined-ippools-gb200.yaml
  3. Create the SR-IOV IB networks using the respective combined-sriovibnet-gb200.yaml or combined-sriovibnet-b200.yaml that were created in an earlier step:

    kubectl apply -f /cm/shared/runai/combined-sriovibnet-gb200.yaml

Note

You may need to modify the interface names for non-DGX systems.

  1. Create the SR-IOV node pool configuration using the sriov-node-pool-config.yaml:

    kubectl apply -f /cm/shared/runai/sriov-node-pool-config.yaml

Note

This will typically reconfigure NICs and may result in a node reboot. The supplied YAML sets the maxUnavailable field to 20%. This value should be adjusted to align with your operational requirements. A value of 1 would have the effect of serializing the upgrade and would result in blocking upon a single node failure. It may be appropriate for a small lab deployment to set it to 100%. This would prevent any single machine failure from blocking the remaining nodes from upgrading. For larger clusters, setting the value to a lower percentage means that the upgrade process will be effectively split into batches.

  1. Validate by describing one of the DGX nodes and checking for SRIOV devices:

    # Describe a DGX worker node
    kubectl describe node <dgx-node> --context=kubernetes-admin@k8s-user | grep sriovib
    
    
    # Example output
      nvidia.com/sriovib_resource_a:  16
      nvidia.com/sriovib_resource_b:  16
      nvidia.com/sriovib_resource_c:  16
      nvidia.com/sriovib_resource_d:  16
      nvidia.com/sriovib_resource_a:  16
      nvidia.com/sriovib_resource_b:  16
      nvidia.com/sriovib_resource_c:  16
      nvidia.com/sriovib_resource_d:  16
    nvidia.com/sriovib_resource_b
    
    # Check the state of SR-IOV Nodes
    kubectl get -n network-operator sriovnetworknodestate --context=kubernetes-admin@k8s-user
    # Example Output
    NAME        SYNC STATUS
    <dgx_node>   Succeeded

Note

It might take several minutes for these settings to take effect. If the sriovnetworkconfig daemon changes the NIC config, then a node reboot will occur.

  1. Validate by running the DGX SuperPOD platform specific tests:

    1. For GB200 - ib-test-gb200.yaml:

      #DGX GB200
      
      kubectl apply -f /cm/shared/runai/ib-test-gb200.yaml -n default
      
      MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
      
      CUDA Runtime Version: 12080
      
      CUDA Driver Version: 12080
      
      Driver Version: 570.172.08
      
      Process 0 (nvbandwidth-test-worker-0): device 0: NVIDIA GB200 (00000008:01:00)
      
      Process 1 (nvbandwidth-test-worker-0): device 1: NVIDIA GB200 (00000009:01:00)
      
      Process 2 (nvbandwidth-test-worker-0): device 2: NVIDIA GB200 (00000018:01:00)
      
      Process 3 (nvbandwidth-test-worker-0): device 3: NVIDIA GB200 (00000019:01:00)
      
      Process 4 (nvbandwidth-test-worker-1): device 0: NVIDIA GB200 (00000008:01:00)
      
      Process 5 (nvbandwidth-test-worker-1): device 1: NVIDIA GB200 (00000009:01:00)
      
      Process 6 (nvbandwidth-test-worker-1): device 2: NVIDIA GB200 (00000018:01:00)
      
      Process 7 (nvbandwidth-test-worker-1): device 3: NVIDIA GB200 (00000019:01:00)
      
      Running host_to_device_memcpy_ce.
      
      memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
                 0         1         2         3
       0     85.59     95.33    200.73    191.27
      
      SUM host_to_device_memcpy_ce 572.93
    2. For B200 - ib-test-b200.yaml:

      # DGX B200
      kubectl apply -f /cm/shared/runai/ib-test-b200.yaml -n default
      
      kubectl logs -n default nccl-test-launcher-hdm54 
      Warning: Permanently added '[nccl-test-worker-0.nccl-test.default.svc]:2222' (ED25519) to the list of known hosts.
      Warning: Permanently added '[nccl-test-worker-1.nccl-test.default.svc]:2222' (ED25519) to the list of known hosts.
      # nThread 1 nGpus 1 minBytes 16 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
      #
      # Using devices
      #  Rank  0 Group  0 Pid     46 on nccl-test-worker-0 device  0 [0x1b] NVIDIA B200
      #  Rank  1 Group  0 Pid     47 on nccl-test-worker-0 device  1 [0x43] NVIDIA B200
      #  Rank  2 Group  0 Pid     48 on nccl-test-worker-0 device  2 [0x52] NVIDIA B200
      #  Rank  3 Group  0 Pid     49 on nccl-test-worker-0 device  3 [0x61] NVIDIA B200
      #  Rank  4 Group  0 Pid     50 on nccl-test-worker-0 device  4 [0x9d] NVIDIA B200
      #  Rank  5 Group  0 Pid     52 on nccl-test-worker-0 device  5 [0xc3] NVIDIA B200
      #  Rank  6 Group  0 Pid     55 on nccl-test-worker-0 device  6 [0xd1] NVIDIA B200
      #  Rank  7 Group  0 Pid     59 on nccl-test-worker-0 device  7 [0xdf] NVIDIA B200
      #  Rank  8 Group  0 Pid     46 on nccl-test-worker-1 device  0 [0x1b] NVIDIA B200
      #  Rank  9 Group  0 Pid     47 on nccl-test-worker-1 device  1 [0x43] NVIDIA B200
      #  Rank 10 Group  0 Pid     48 on nccl-test-worker-1 device  2 [0x52] NVIDIA B200
      #  Rank 11 Group  0 Pid     49 on nccl-test-worker-1 device  3 [0x61] NVIDIA B200
      #  Rank 12 Group  0 Pid     50 on nccl-test-worker-1 device  4 [0x9d] NVIDIA B200
      #  Rank 13 Group  0 Pid     51 on nccl-test-worker-1 device  5 [0xc3] NVIDIA B200
      #  Rank 14 Group  0 Pid     54 on nccl-test-worker-1 device  6 [0xd1] NVIDIA B200
      #  Rank 15 Group  0 Pid     58 on nccl-test-worker-1 device  7 [0xdf] NVIDIA B200
      #
      #                                                              out-of-place                       in-place          
      #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
      #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
                16             4     float     sum      -1    38.60    0.00    0.00      0    49.18    0.00    0.00      0
                32             8     float     sum      -1    42.23    0.00    0.00      0    50.96    0.00    0.00      0
                64            16     float     sum      -1    47.44    0.00    0.00      0    42.18    0.00    0.00      0
               128            32     float     sum      -1    39.51    0.00    0.01      0    42.78    0.00    0.01      0
               256            64     float     sum      -1    40.16    0.01    0.01      0    43.35    0.01    0.01      0
               512           128     float     sum      -1    39.22    0.01    0.02      0    44.53    0.01    0.02      0
              1024           256     float     sum      -1    42.81    0.02    0.04      0    44.47    0.02    0.04      0
              2048           512     float     sum      -1    40.63    0.05    0.09      0    52.70    0.04    0.07      0
              4096          1024     float     sum      -1    46.76    0.09    0.16      0    52.63    0.08    0.15      0
              8192          2048     float     sum      -1    47.22    0.17    0.33      0    53.69    0.15    0.29      0
             16384          4096     float     sum      -1    49.02    0.33    0.63      0    50.96    0.32    0.60      0
             32768          8192     float     sum      -1    54.24    0.60    1.13      0    53.88    0.61    1.14      0
             65536         16384     float     sum      -1    59.05    1.11    2.08      0    59.53    1.10    2.06      0
            131072         32768     float     sum      -1    62.04    2.11    3.96      0    63.99    2.05    3.84      0
            262144         65536     float     sum      -1    106.4    2.46    4.62      0    103.1    2.54    4.77      0
            524288        131072     float     sum      -1    107.5    4.88    9.15      0    102.8    5.10    9.56      0
           1048576        262144     float     sum      -1    108.8    9.64   18.07      0    106.6    9.83   18.44      0
           2097152        524288     float     sum      -1    112.7   18.60   34.88      0    106.6   19.67   36.88      0
           4194304       1048576     float     sum      -1    118.2   35.49   66.54      0    116.6   35.97   67.44      0
           8388608       2097152     float     sum      -1    150.2   55.85  104.72      0    153.8   54.54  102.26      0
          16777216       4194304     float     sum      -1    187.5   89.46  167.73      0    188.1   89.19  167.23      0
          33554432       8388608     float     sum      -1    250.5  133.97  251.20      0    251.6  133.35  250.02      0
          67108864      16777216     float     sum      -1    395.9  169.52  317.86      0    395.1  169.87  318.50      0
         134217728      33554432     float     sum      -1    618.9  216.85  406.59      0    620.8  216.20  405.37      0
         268435456      67108864     float     sum      -1   1073.4  250.08  468.90      0   1074.2  249.89  468.54      0
         536870912     134217728     float     sum      -1   1977.2  271.53  509.13      0   1976.0  271.69  509.42      0
        1073741824     268435456     float     sum      -1   3713.5  289.14  542.15      0   3710.3  289.40  542.62      0
        2147483648     536870912     float     sum      -1   7245.1  296.40  555.76      0   7226.3  297.18  557.20      0
        4294967296    1073741824     float     sum      -1    14049  305.71  573.20      0    13939  308.13  577.75      0
        8589934592    2147483648     float     sum      -1    27360  313.97  588.68      0    27292  314.74  590.13      0
       17179869184    4294967296     float     sum      -1    53941  318.49  597.17      0    53953  318.43  597.05      0
      # Out of bounds values : 0 OK
      # Avg bus bandwidth    : 168.649 
      
      
      # Clean up after validating via ib-test-b200.yaml
      kubectl delete -f /cm/shared/runai/ib-test-b200.yaml -n default 

Note

The Network Operator will restart the DGX nodes if the number of Virtual Functions in the SR-IOV Network Policy file does not match the NVIDIA/Mellanox firmware configuration.

Apply Security Policies - Optional

By default, BCM Kubernetes deployment has permissive security policies to ease in development environments. For production clusters or in secure environments, it’s recommended to take additional steps to harden the cluster. This includes steps such as configuring permission manager, applying Kyverno policies, and applying Calico policies.

For deployments of NVIDIA Run:ai as a part of NVIDIA Mission Control, please reach out to your NVIDIA representative for the latest example configurations and suggested policies. The Mission Control software installation guide’s Kubernetes Security Hardening documentation provides guidance for application.

Create Node Pools - Optional

See Node pools to create and manage groups of nodes (either by predefined node label or administrator-defined node labels). This optional configuration step can be used for advanced deployment scenarios to allocate different resources across teams or projects.

Add Additional Users - Optional

See Users for steps on adding additional users beyond the initial [email protected] account or connecting SSO.

Install the NVIDA Run:ai Command-line - Optional

To obtain the command line binary, see the Install and configure CLI section.

Test the command line tool installation

Validate the installation by running the following command:

runai version

Note

If NVIDIA Run:ai had previously been installed via BCM, it may be necessary to update the command line version.

Set the Control Plane URL

The following step is required for Windows users only. Linux and Mac clients are configured via the installation script.

Run the following command (substituting the NVIDIA Run:ai control plane FQDN value specified in previous steps) to create the config.json file in the default path:

runai config set --cp-url runai.mycorp.local

Alternative, the Base Command Manager installation assistant can generate this config with the following steps:

Validate NVIDIA Run:ai

To validate the installation, please refer to the quick start guides for deploying single-GPU training jobs, multi-node training jobs, single-GPU inference jobs, and multi-GPU inference jobs. Certain NGC workloads may require adding NGC API keys and docker credentials into the cluster.

  1. Validate the ingress IP for NVIDIA Run:ai inference is configured, EXTERNAL-IP should have the value configured in the prior MetalLB steps:

    kubectl get svc -n knative-serving kourier -o wide
    
    NAME      TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)                      AGE
    kourier   LoadBalancer   x.x.x.x          10.1.1.26      80:31038/TCP,443:30783/TCP   8h
  2. Validate distributed training workloads, see Run you first distributed training workload:

    # Example command
    runai training mpi submit distributed-training \
      -g 4 \
      -p training \
      --node-pools nvl72rackb06 \
      -i ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163 \
      --workers 2 \
      --slots-per-worker 4 \
      --run-as-uid 1000 \
      --ssh-auth-mount-path /home/mpiuser/.ssh \
      --clean-pod-policy Running \
      --master-command mpirun \
      --master-args "--bind-to core --map-by ppr:4:node -np 8 --report-bindings -q nvbandwidth -t multinode_device_to_device_memcpy_read_ce" \
      --command -- /usr/sbin/sshd -De -f /home/mpiuser/.sshd_config
    
  3. Validate distributed inference workloads, see Run your first custom inference workload:

    {
        "name": "distributed-vllm",
        "projectId": "4501034",
        "clusterId": "c7cd67df-c309-45ac-9056-5a04d074617d",
        "spec": {
            "workers": 1,
            "replicas": 1,
            "servingPort": {
                "port": 8000,
                "exposedUrl": "http://vllm.infernece-calorado.runailabs-ps.com/"
            },
            "leader": {
                "image": "vllm/vllm-openai:latest-aarch64",
                "command": "sh -c \"bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); python3 -m vllm.entrypoints.openai.api_server --port 8000 --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 4 --pipeline_parallel_size 2\"",
                "environmentVariables": [
                    {
                        "name": "NCCL_MNNVL_ENABLE",
                        "value": "0"
                    },
                    {
                        "name": "HF_TOKEN",
                        "value": "hf_xxx
                    }
                ],
                "compute": {
                    "largeShmRequest": true,
                    "gpuDevicesRequest": 4
                }
            },
            "worker": {
                "image": "vllm/vllm-openai:latest-aarch64",
                "command": "sh -c \"bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)\"",
                 "environmentVariables": [
                    {
                        "name": "NCCL_MNNVL_ENABLE",
                        "value": "0"
                    },
                    {
                        "name": "HF_TOKEN",
                        "value": "<HF_TOKEN>"
                    }
                ],
               "compute": {
                    "largeShmRequest": true,
                    "gpuDevicesRequest": 4
                }
            }
        }
    }
    

Troubleshooting Common Issues

Slow installation

Provide a registry mirror when requested in the wizard. If one isn’t available, authenticated access to Docker Hub can avoid potential rate limiting for at least some of the artifact pulls:

# Authenticate to Docker Hub prior to running cm-kubernetes-setup
echo -n "$DOCKERHUB_TOKEN" | docker login --username "$DOCKERHUB_USER" 
--password-stdin
Delayed responsiveness from the cmsh command

If encountering slow response when running the cmsh command, try using the cmsh-lazy-load command (substituting it for cmsh wherever referenced in the documentation).

# Example: use of cmsh-lazy-load as substitute for cmsh
cmsh-lazy-load -c "device list; quit"
Failed installation

If encountering issues with installation failure (which should be evident immediately) ensure that the DGX node kernel parameters are not inadvertently forcing Cgroup v1 vs Cgroup v2:

# the following kernel parameters should not be present 
systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller
Shared Storage (NFS) configuration

If encountering issues indicating problems consistently accessing Persistent Volumes (PVs) ensure that NFSv3 for /cm/shared for both of the node categories that’ll be used later in this guide. For example (please substitute the category name as appropriate for the DGX type):

# Force NFSv3 for the worker node category
cmsh -c "category use dgx-gb200-k8s; fsmounts; use /cm/shared; set mountoptions defaults,_netdev,vers=3; commit; quit"

# Force NFSv3 on the CPU nodes
cmsh -c "category use k8s-system-user; fsmounts; use /cm/shared; set mountoptions defaults,_netdev,vers=3; commit; quit"
MetalLB Load Balancer manual installation

Since there’s shared use of CPU nodes for the combined control plane elements in this architecture, BCM configures MetalLB and adjusts node labels to run. The following would be required as a manual step when deploying MetalLB in this manner:

# Remove the exclusion preventing nodes from receiving load balancer traffic
kubectl label nodes --all node.kubernetes.io/exclude-from-external-load-balancers-

Note

The above is not required when using the BCM installation assistant. It’s included here to assist with alternative deployment approaches on DGX SuperPOD / BasePOD.

NVIDIA Run:ai exact version selection

The BCM installation assistant will pull the latest NVIDIA Run:ai patch release available for the minor version selected. The following can be used to indicate which version will be installed:

helm repo add runai https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm search repo runai/control-plane --versions | grep "2.22"

runai-backend/control-plane	2.22.48      	2.22.48    	Run:ai Control Plane

TLS Certificates

Certificate Categories

The four main categories of certificates that may apply here:

  • Public (globally trusted) certificates - These are issued by well-known Certificate Authorities (CAs) and are recognized automatically by most operating systems, browsers, and clients because the issuing CA’s root certificate is already included in default trust stores. Public certificates are used for services that must be trusted without additional client configuration:

    # List default pre-installed root CA certificates
    
    dpkg -L ca-certificates | grep '\.crt$'
  • Internal (organization-issued) certificates - Issued by a corporate or internal CA (e.g. a managed PKI, or an internal CA cluster). These are “real” X.509 certificates, but the issuing CA is not part of the global trust chain. They are trusted only within the organization once the internal root CA certificate is distributed to hosts and clusters.

  • Local CA–issued certificates - These are “real” certificates generated by first creating a private Certificate Authority (CA) root, then using it to sign individual server or client certificates. Clients only need the CA’s root certificate installed once, after which all certificates issued by that CA are trusted automatically. These are typically used for quick (short-lived) testing or isolated scenarios but are not recommended for production scenarios since they are not globally trusted or distributed automatically for potential client access. Additionally, they create a management and process burden on the creator for common maintenance (e.g. rotation, revocation tasks).

  • Local (self-signed) certificates (not supported) - Self-signed certificates are unique because each one is created and signed by itself, acting as its own trust anchor. Unlike certificates issued by a shared CA, there’s no common root of trust. Every certificate must be explicitly installed and trusted on every client that needs to connect to a given service. In a distributed environment with many services (such as with NVIDIA Run:ai on SuperPOD), this means that you would need to distribute and manage dozens of separate certificates across all nodes and clients. Self-signed certificates are explicitly unsupported and should not be used with NVIDIA Run:ai.

Certificate File Formats

Regardless of how a certificate is issued, the encoding is the same: X.509 Base64-encoded in PEM format. File extensions such as .pem, .crt, and .cer are technically interchangeable in this context so long as the contents are properly encoded, but conventionally .crt is used for certificates, .key for private keys, and .pem for generic containers or chains.

Ensure you are working with PEM-encoded certificates by examining them to make sure they resemble the following:

# Example (not valid for use, from openssl project test certs)

less /cm/shared/runai/ca.crt

-----BEGIN CERTIFICATE-----
MIIDATCCAemgAwIBAgIBATANBgkqhkiG9w0BAQsFADASMRAwDgYDVQQDDAdSb290
IENBMCAXDTIwMTIxMjIwMDk0OVoYDzIxMjAxMjEzMjAwOTQ5WjASMRAwDgYDVQQD
DAdSb290IENBMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA4eYA9Qa8
oEY4eQ8/HnEZE20C3yubdmv8rLAh7daRCEI7pWM17FJboKJKxdYAlAOXWj25ZyjS
feMhXKTtxjyNjoTRnVTDPdl0opZ2Z3H5xhpQd7P9eO5b4OOMiSPCmiLsPtQ3ngfN
wCtVERc6NEIcaQ06GLDtFZRexv2eh8Yc55QaksBfBcFzQ+UD3gmRySTO2I6Lfi7g
MUjRhipqVSZ66As2Tpex4KTJ2lxpSwOACFaDox+yKrjBTP7FsU3UwAGq7b7OJb3u
aa32B81uK6GJVPVo65gJ7clgZsszYkoDsGjWDqtfwTVVfv1G7rrr3Laio+2Ff3ff
tWgiQ35mJCOvxQIDAQABo2AwXjAPBgNVHRMBAf8EBTADAQH/MAsGA1UdDwQEAwIB
BjAdBgNVHQ4EFgQUjvUlrx6ba4Q9fICayVOcTXL3o1IwHwYDVR0jBBgwFoAUjvUl
rx6ba4Q9fICayVOcTXL3o1IwDQYJKoZIhvcNAQELBQADggEBAL2sqYB5P22c068E
UNoMAfDgGxnuZ48ddWSWK/OWiS5U5VI7R/c8vjOCHU1OI/eQfhOenXxnHNF2QBuu
bjdg5ImPsvgQNFs6ZUgenQh+E4JDkTpn7bKCgtK7qlAPUXZRZI6uAaH5zKu3yFPU
2kow3LFCwYutrSfVg6JYeX+cuYsLHFzNzOhqh88Mu9yJ7pPJ8faeHFglHa51eoaw
vurAVknk7tzUxLZN0PxD9nrduVwtiluFbCPz0EtP5Dt1KylGdPrKvCJNkFkRJX+S
0t9VNIhyqLmslP5uSFtuTt8toXkizaYlxIVHckkvpuKZB8m7l8C/lom9sqagjZ1J
If+teEc=
-----END CERTIFICATE-----

If you receive a file in another format (e.g., binary .cer), consult openssl instructions to convert it to PEM before using.

Certificate Chain Components

Certificates typically exist as part of a chain of trust:

  • Leaf (server/client) certificate - The certificate presented by a service (e.g., runai.mycorp.local). It proves the service’s identity but cannot usually be validated on its own.

  • Intermediate certificate(s) - Issued by a root or higher-level CA. These form the link between the root and the leaf. Many public CAs issue one or more intermediates for operational security.

  • Root CA certificate - The ultimate trust anchor. Public roots are distributed in operating system/browser trust stores; internal or local roots must be installed manually in those stores.

  • Full chain (fullchain.pem) - A bundle containing the leaf certificate followed by its intermediate certificate(s). This file is often required by web servers (NGINX, Apache, Kubernetes ingress) so clients receive the complete trust path from the service back to a trusted root.

Subject Alternative Names (SANs)

Subject Alternative Names (SANs) extend a certificate beyond a single Common Name (CN) by allowing multiple hostnames, domains, or IP addresses to be listed as valid identities. Modern TLS clients validate the SAN field rather than the CN, making it essential for compatibility and trust. When issuing certificates from a local CA, SANs should always be included so that services can be securely accessed under all expected names (e.g., runai.mycorp.local, runai, and 10.1.1.25).

Last updated