NVIDIA Dynamo Disaggregated Inference Deployment

This tutorial demonstrates how to run a multi-node, disaggregated inference workload using the RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic model with NVIDIA Dynamo, vLLM, and NVIDIA Grove on the NVIDIA Run:ai platform. You can use this workflow as a reference and adapt it for your own models, container images, and hardware configurations.

Note

Verify the specific version of Dynamo is supported on your target hardware according to the Dynamo support matrix.

In this tutorial, you will learn how to:

Set up environment prerequisites for Dynamo-based inference workload
Create a user access key for API integrations with NVIDIA Run:ai
Submit and run a multi-node inference workload through the NVIDIA Run:ai user interface, API, or CLI
Access the inference endpoint to send requests

Prerequisites

Cluster Prerequisites

Ensure that the NVIDIA Dynamo platform is installed in the cluster. This includes the required Custom Resource Definitions (CRDs) and Dynamo platform components.

The following steps must be performed by your administrator.

Set environment variables:

export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.7.1

Install CRDs:

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

Install the Dynamo platform:

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set "grove.enabled=true"

NVIDIA Run:ai Prerequisites

Before you start, make sure the following requirements are met:

Your administrator has:
- Created a project for you.
- Enabled network topology-aware scheduling in the relevant node pool, allowing the NVIDIA Run:ai Scheduler to place distributed pods close to each other in the network for optimal performance. For more details, see Accelerating workloads with network topology-aware scheduling.
- For GB200 NVL72 and Multi-Node NVLink (MNNVL) systems:
  - Enabled GB200 NVL72 and Multi-Node NVLink (MNNVL) support in the relevant node pool, ensuring that distributed workloads can leverage high-bandwidth interconnects and topology-aware scheduling. For more details, see Using GB200 NVL72 and Multi-Node NVLink domains.
  - Enabled GPU network acceleration in the cluster configuration by setting the appropriate flag, anyworkload-controller.GPUNetworkAccelerationEnabled=true. For details on how to configure this value using Helm or runaiconfig, see Advanced cluster configurations.
You have:
- Access to the model if you are using a gated Hugging Face model that requires an HF_TOKEN. Generate a token from your Hugging Face account and provide it to the workload as an environment variable. The model used in this tutorial is not gated and does not require authentication.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 2: Creating a User Access Key

Access keys are used for API integrations with NVIDIA Run:ai. An access key contains a client ID and a client secret. With the client credentials, you can obtain a token and use it within subsequent API calls.

In the NVIDIA Run:ai user interface:

Click the user avatar at the top right corner, then select Settings
Click +ACCESS KEY
Enter the access key's name and click CREATE
Copy the Client ID and Client secret and store securely
Click DONE

To request an API access token, use the client credentials to get a token to access NVIDIA Run:ai using the Tokens API. For example:

curl -X POST \ 
  # Replace <COMPANY_URL> below with:
  # For SaaS, use <tenant-name>.run.ai
  # For self-hosted use the NVIDIA Run:ai user interface URL.
  'https://<COMPANY_URL>/api/v1/token' \ 
  --header 'Accept: */*' \ 
  --header 'Content-Type: application/json' \ 
  --data-raw '{ 
  "grantType":"client_credentials", 
  "clientId":"<CLIENT ID>", 
  "clientSecret" : "<CLIENT SECRET>" 
}'

Step 3: Creating a PVC Data Source

To reduce cold-start latency and cache model weights across runs, create a data source in the form of a Persistent Volume Claim (PVC). The PVC can be mounted to workloads and will persist after the workload completes, allowing any data it contains to be reused.

Note

The first time a workload is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.

To create a PVC, go to Workload manager → Data sources.
Click +NEW DATA SOURCE and select PVC from the dropdown menu.
Within the new form, set the scope.

Important

PVC data sources created at the cluster or department level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.

Enter a name for the data source. The name must be unique.
For the data options, select New PVC and the storage class that suits your needs:
- To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode.
- Enter 2 TB for the claim size to ensure we have plenty of capacity for future workloads
- Select Filesystem (default) as the volume mode. The volume will be mounted as a filesystem, enabling the usage of directories and files.
- Set the Container path to /opt/models which is where the PVC will be mounted inside containers.
Click CREATE DATA SOURCE

After creating the data source, wait for the PVC to be provisioned. The PVC claim name (which is displayed in the UI as the Kubernetes name) will appear in the Data sources grid once it’s ready. This claim name is the exact value that will be used for the <pvc-claim-name> when submitting the workload.

Copy the following command to your terminal. Make sure to update the following parameters:

<COMPANY-URL> - The link to the NVIDIA Run:ai user interface.
<TOKEN> - The API access token obtained in Step 2.

For all other parameters within the JSON body, refer to the PVC API.

After creating the data source, wait for the PVC to be provisioned. Use the List PVC asset API to retrieve the claim name. This claim name is the exact value that will be used for the <pvc-claim-name> when submitting the workload.

curl -L 'https://<COMPANY-URL>/api/v1/asset/datasource/pvc' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "meta": {
      "name": "<pvc-name>",
      "scope": "<scope>"
    },
    "spec": {
      "path": "/opt/models",
      "existingPvc": false,
      "claimInfo": {
        "size": "2TB",
        "storageClass": "<my-storage-class>",
        "accessModes": {
          "readWriteMany": true
        },
        "volumeMode": "Filesystem"
      }
    }
  }'

Step 4: Creating the Workload

The configuration below submits a DynamoGraphDeployment that defines a disaggregated, multi-node inference pipeline composed of multiple Dynamo services, each responsible for a distinct stage of inference execution.

How the Configuration Works

Defines a Dynamo-based inference workload that is registered and managed by NVIDIA Run:ai.
Uses vLLM as the backend framework for model execution.
Deploys a single frontend service that exposes an HTTP endpoint for inference requests.
Deploys two worker services:
- VllmPrefillWorker for the prefill stage
- VllmDecodeWorker for the decode stage
Configures the workload as a disaggregated, multi-node deployment, where prefill and decode run as independent worker services and can be scheduled on different nodes.
Each worker runs as one replica and requests 8 GPUs per replica, enabling parallel execution across stages.
Configures shared memory (80 Gi) for each worker to support performance-sensitive inference execution.

For more details, see the NVIDIA Dynamo documentation.

Submitting the Workload

To create a workload, go to Workload manager → Workloads.
Click + NEW WORKLOAD and select Via YAML from the dropdown.
In the YAML submission form, select the cluster where the workload will run.
Upload or paste your YAML manifest. To upload a file, click UPLOAD YAML FILE and choose your YAML. To paste the YAML, insert it directly into the editor. Before submitting, update the following fields in the manifest:
- <pvc-claim-name> - The claim name associated with the PVC created in Step 3.
- <your-hf-token> - Required only for gated Hugging Face models. Generate a token from your Hugging Face account and provide it to the workload as an environment variable.

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llama3-70b-disagg-mn
spec:
  backendFramework: vllm
  pvcs:
    - name: <pvc-claim-name>
      create: false
  services:
    Frontend:
      componentType: frontend
      dynamoNamespace: llama3-70b-disagg-mn
      volumeMounts:
        - name: <pvc-claim-name>
          mountPoint: /opt/models
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1
          workingDir: /workspace/examples/backends/vllm
      envs:
        - name: HF_HOME
          value: /opt/models
        - name: HF_TOKEN
          value: <your-hf-token>
      replicas: 1
    VllmPrefillWorker:
      componentType: worker
      dynamoNamespace: llama3-70b-disagg-mn
      volumeMounts:
        - name: <pvc-claim-name>
          mountPoint: /opt/models
      sharedMemory:
        size: 80Gi
      extraPodSpec:
        mainContainer:
          env:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
            - name: HF_HOME
              value: /opt/models
            - name: HF_TOKEN
              value: <your-hf-token>
          args:
          - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --is-prefill-worker --gpu-memory-utilization 0.95 --no-enable-prefix-caching --block-size 128"
          command:
          - /bin/sh
          - -c
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources:
        limits:
          gpu: "8"
        requests:
          gpu: "8"
    VllmDecodeWorker:
      componentType: worker
      dynamoNamespace: llama3-70b-disagg-mn
      volumeMounts:
        - name: <pvc-claim-name>
          mountPoint: /opt/models
      sharedMemory:
        size: 80Gi
      extraPodSpec:
        mainContainer:
          env:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
            - name: HF_HOME
              value: /opt/models
            - name: HF_TOKEN
              value: <your-hf-token>
          args:
          - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128"
          command:
          - /bin/sh
          - -c
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources:
        limits:
          gpu: "8"
        requests:
          gpu: "8"

Select a project. If the namespace is not defined in the YAML, select a project from the submission form.
Click CREATE WORKLOAD

Copy the following command to your terminal. Make sure to update the following parameters. For more details, see Workload V2 API:

<COMPANY-URL> - The link to the NVIDIA Run:ai user interface.
<TOKEN> - The API access token obtained in Step 2.
<PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.
<pvc-claim-name> - The claim name associated with the PVC created in Step 3.
<your-hf-token> - Required only for gated Hugging Face models. Generate a token from your Hugging Face account and provide it to the workload as an environment variable.

curl -L 'https://<COMPANY-URL>/api/v2/workloads' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "metadata": {
      "name": "llama3-disagg-mn",
      "projectId": "<PROJECT-ID>"
    },
    "manifest": {
      "apiVersion": "nvidia.com/v1alpha1",
      "kind": "DynamoGraphDeployment",
      "metadata": {
        "name": "llama3-70b-disagg-mn"
      },
      "spec": {
        "backendFramework": "vllm",
        "pvcs": [
          {
            "name": "<pvc-claim-name>",
            "create": false
          }
        ],
        "services": {
          "Frontend": {
            "componentType": "frontend",
            "dynamoNamespace": "llama3-70b-disagg-mn",
            "volumeMounts": [
              {
                "name": "<pvc-claim-name>",
                "mountPoint": "/opt/models"
              }
            ],
            "extraPodSpec": {
              "mainContainer": {
                "image": "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1",
                "workingDir": "/workspace/examples/backends/vllm"
              }
            },
            "envs": [
              {
                "name": "HF_HOME",
                "value": "/opt/models"
              },
              {
              "name": "HF_TOKEN",
              "value": "<your-hf-token>"
              }
            ],
            "replicas": 1
          },
          "VllmPrefillWorker": {
            "componentType": "worker",
            "dynamoNamespace": "llama3-70b-disagg-mn",
            "volumeMounts": [
              {
                "name": "<pvc-claim-name>",
                "mountPoint": "/opt/models"
              }
            ],
            "sharedMemory": {
              "size": "80Gi"
            },
            "extraPodSpec": {
              "mainContainer": {
                "env": [
                  {
                    "name": "SERVED_MODEL_NAME",
                    "value": "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
                  },
                  {
                    "name": "MODEL_PATH",
                    "value": "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
                  },
                  {
                    "name": "HF_HOME",
                    "value": "/opt/models"
                  },
                  {
                    "name": "HF_TOKEN",
                    "value": "<your-hf-token>"
                  }
                ],
                "args": [
                  "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --is-prefill-worker --gpu-memory-utilization 0.80 --no-enable-prefix-caching --block-size 128"
                ],
                "command": [
                  "/bin/sh",
                  "-c"
                ],
                "image": "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1",
                "workingDir": "/workspace/examples/backends/vllm"
              }
            },
            "replicas": 1,
            "resources": {
              "limits": {
                "gpu": "8"
              },
              "requests": {
                "gpu": "8"
              }
            }
          },
          "VllmDecodeWorker": {
            "componentType": "worker",
            "dynamoNamespace": "llama3-70b-disagg-mn",
            "volumeMounts": [
              {
                "name": "<pvc-claim-name>",
                "mountPoint": "/opt/models"
              }
            ],
            "sharedMemory": {
              "size": "80Gi"
            },
            "extraPodSpec": {
              "mainContainer": {
                "env": [
                  {
                    "name": "SERVED_MODEL_NAME",
                    "value": "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
                  },
                  {
                    "name": "MODEL_PATH",
                    "value": "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
                  },
                  {
                    "name": "HF_HOME",
                    "value": "/opt/models"
                  },
                  {
                    "name": "HF_TOKEN",
                    "value": "<your-hf-token>"
                  }
                ],
                "args": [
                  "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --gpu-memory-utilization 0.80 --no-enable-prefix-caching --block-size 128"
                ],
                "command": [
                  "/bin/sh",
                  "-c"
                ],
                "image": "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1",
                "workingDir": "/workspace/examples/backends/vllm"
              }
            },
            "replicas": 1,
            "resources": {
              "limits": {
                "gpu": "8"
              },
              "requests": {
                "gpu": "8"
              }
            }
          }
        }
      }
    }
  }'

Follow the below steps. For more details, see CLI reference:

Save the following YAML file. Before submitting, update the following fields in the manifest:

<pvc-claim-name> - The claim name associated with the PVC created in Step 3.
<your-hf-token> - Required only for gated Hugging Face models. Generate a token from your Hugging Face account and provide it to the workload as an environment variable.

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llama3-70b-disagg-mn
spec:
  backendFramework: vllm
  pvcs:
    - name: <pvc-claim-name>
      create: false
  services:
    Frontend:
      componentType: frontend
      dynamoNamespace: llama3-70b-disagg-mn
      volumeMounts:
        - name: <pvc-claim-name>
          mountPoint: /opt/models
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1
          workingDir: /workspace/examples/backends/vllm
      envs:
        - name: HF_HOME
          value: /opt/models
        - name: HF_TOKEN
          value: <your-hf-token>
      replicas: 1
    VllmPrefillWorker:
      componentType: worker
      dynamoNamespace: llama3-70b-disagg-mn
      volumeMounts:
        - name: <pvc-claim-name>
          mountPoint: /opt/models
      sharedMemory:
        size: 80Gi
      extraPodSpec:
        mainContainer:
          env:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
            - name: HF_HOME
              value: /opt/models
            - name: HF_TOKEN
              value: <your-hf-token>
          args:
          - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --is-prefill-worker --gpu-memory-utilization 0.95 --no-enable-prefix-caching --block-size 128"
          command:
          - /bin/sh
          - -c
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources:
        limits:
          gpu: "8"
        requests:
          gpu: "8"
    VllmDecodeWorker:
      componentType: worker
      dynamoNamespace: llama3-70b-disagg-mn
      volumeMounts:
        - name: <pvc-claim-name>
          mountPoint: /opt/models
      sharedMemory:
        size: 80Gi
      extraPodSpec:
        mainContainer:
          env:
            - name: SERVED_MODEL_NAME
              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
            - name: MODEL_PATH
              value: "/opt/models/hub/models--RedHatAI--Llama-3.3-70B-Instruct-FP8-dynamic/snapshots/ddb4128556dfcff99e0c41aee159ea6c3e655dcd"
            - name: HF_HOME
              value: /opt/models
            - name: HF_TOKEN
              value: <your-hf-token>
          args:
          - "python3 -m dynamo.vllm --model $MODEL_PATH --served-model-name $SERVED_MODEL_NAME --tensor-parallel-size 8 --data-parallel-size 1 --disable-log-requests --gpu-memory-utilization 0.90 --no-enable-prefix-caching --block-size 128"
          command:
          - /bin/sh
          - -c
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.7.1
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources:
        limits:
          gpu: "8"
        requests:
          gpu: "8"

Run the the following command. Make sure to update the below with the name of your project:
```
runai workload submit \
  --file <name>.yaml \
  --project <project-name>
```

Step 5: Verifying the Workload Status

After submitting the workload, wait for it to reach the Running status in the Workloads table. A workload becomes Ready to accept inference requests only after all its pods have fully initialized, including model loading.

Large models may require several minutes to load their weights, especially when the model is stored on a PVC. During this time, the workload may remain in Initializing even though pods are already running.

To monitor progress, select the workload and click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the Logs tab to track model-download and model-loading progress. Select the relevant pod from the dropdown and review the pod logs.

Once the workload reaches Running and shows an available Connection, you can proceed to access the inference endpoint.

Step 6: Accessing the Inference Workload

You can programmatically consume an inference workload via API by making direct calls to the serving endpoint, typically from other workloads or external integrations. Once an inference workload is deployed, the serving endpoint URL appears in the Connections column of the Workloads table. To retrieve the service endpoint programmatically, use the Get Workloads API. The endpoint URL will be available in the response body under urls.

#replace <serving-endpoint-url> and <model-name> (e.g. "meta-llama/Llama-3.1-8B-Instruct")
curl <serving-endpoint-url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-name>",
    "messages": [{
      "role": "user",
      "content": "Tell me about NVIDIA Dynamo"
    }]
  }'

Note

Configuring access to the inference endpoint is not supported for Dynamo-based deployments. As a result, the inference serving endpoint created in this tutorial is not protected.

Access control for inference endpoints (such as restricting access to authenticated users, service accounts, or groups) is planned for a future release.

PreviousNVIDIA Dynamo Aggregated Inference Deployment NextGeneral Settings

Last updated 1 day ago

Good afternoon

hashtagPrerequisites

hashtagCluster Prerequisites

hashtagNVIDIA Run:ai Prerequisites

hashtagStep 1: Logging In

hashtagStep 2: Creating a User Access Key

hashtagStep 3: Creating a PVC Data Source

hashtagStep 4: Creating the Workload

hashtagHow the Configuration Works

hashtagSubmitting the Workload

hashtagStep 5: Verifying the Workload Status

hashtagStep 6: Accessing the Inference Workload

Prerequisites

Cluster Prerequisites

NVIDIA Run:ai Prerequisites

Step 1: Logging In

Step 2: Creating a User Access Key

Step 3: Creating a PVC Data Source

Step 4: Creating the Workload

How the Configuration Works

Submitting the Workload

Step 5: Verifying the Workload Status

Step 6: Accessing the Inference Workload