Launching Workloads with GPU Memory Swap

This quick start provides a step-by-step walkthrough for running multiple LLMs (inference workload) on a single GPU using GPU memory swap.

GPU memory swap expands the GPU physical memory to the CPU memory, allowing NVIDIA Run:ai to place and run more workloads on the same GPU physical hardware. This provides a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide.

Prerequisites

Before you start, make sure:

Note

  • Flexible workload submission is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.

  • The Custom inference type appears only if your administrator has enabled it under General settings → Workloads → Models. If not enabled, Custom becomes the default inference type and is not displayed as a selectable option.

  • Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 2: Submitting the First Inference Workload

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Inference

  3. Select under which cluster to create the workload

  4. Select the project in which your workload will run

  5. Select custom inference from Inference type (if applicable)

  6. Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE

    In the next step:

  8. Click the load icon. A side pane appears, displaying a list of available environments. To add a new environment:

    • Click the + icon to create a new environment

    • Enter quick-start as the name for the environment. The name must be unique.

    • Enter the NVIDIA Run:ai vLLM Image URL - runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0

    • Set the inference serving endpoint to HTTP and the container port to 8000

    • Set the runtime settings for the environment. Click +ENVIRONMENT VARIABLE and add the following:

      • Name: RUNAI_MODEL Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct (you can choose any vLLM supporting model from Hugging Face)

      • Name: RUNAI_MODEL_NAME Source: Custom Value: Llama-3.2-1B-Instruct

      • Name: HF_TOKEN Source: Custom Value: <Your Hugging Face token> (only needed for gated models)

      • Name: VLLM_RPC_TIMEOUT Source: Custom Value: 60000

    • Click CREATE ENVIRONMENT

    • Select the newly created environment from the side pane

  9. Click the load icon. A side pane appears, displaying a list of available compute resources. To add a new compute resource:

    • Click the + icon to create a new compute resource

    • Enter request-limit as the name for the compute resource. The name must be unique.

    • Set GPU devices per pod - 1

    • Enable GPU fractioning to set the GPU memory per device:

      • Select % (of device) - Fraction of a GPU device’s memory

      • Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)

      • Set the memory Limit - 100%

    • Optional: set the CPU compute per pod - 0.1 cores (default)

    • Optional: set the CPU memory per pod - 100 MB (default)

    • Select More settings and toggle Increase shared memory size

    • Click CREATE COMPUTE RESOURCE

    • Select the newly created compute resource from the side pane

  10. Click CREATE INFERENCE

Step 3: Submitting the Second Inference Workload

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Inference

  3. Select the cluster where the previous inference workload was created

  4. Select the project where the previous inference workload was created

  5. Select custom inference from Inference type (if applicable)

  6. Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE

    In the next step:

  8. Click the load icon. A side pane appears, displaying a list of available environments. Select the environment created in Step 2.

  9. Click the load icon. A side pane appears, displaying a list of available compute resources. Select the compute resources created in Step 2.

  10. Click CREATE INFERENCE

Step 4: Submitting the First Workspace

  1. Go to the Workload manager → Workloads

  2. Click COLUMNS and select Connections

  3. Select the link under the Connections column for the first inference workload created in Step 2

  4. In the Connections Associated with Workload form, copy the URL under the Address column

  5. Click +NEW WORKLOAD and select Workspace

  6. Select the cluster where the previous inference workloads were created

  7. Select the project where the previous inference workloads were created

  8. Select Start from scratch to launch a new workspace quickly

  9. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  10. Click CONTINUE

    In the next step:

  11. Click the load icon. A side pane appears, displaying a list of available environments. Select the ‘chatbot-ui’ environment for your workspace (Image URL: runai.jfrog.io/core-llm/llm-app)

    • Set the runtime settings for the environment with the following environment variables:

      • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

      • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4

      • Delete the PATH_PREFIX environment variable if you are using host-based routing.

    • If ‘chatbot-ui’ is not displayed in the gallery, follow the below steps:

      • Click the + icon to create a new environment

      • Enter chatbot-ui as the name for the environment. The name must be unique.

      • Enter the chatbot-ui Image URL - runai.jfrog.io/core-llm/llm-app

      • Tools - Set the connection for your tool

        • Click +TOOL

        • Select Chatbot UI tool from the list

      • Set the runtime settings for the environment. Click +ENVIRONMENT VARIABLE and add the following:

        • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

        • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the Address link

        • Name: RUNAI_MODEL_TOKEN_LIMIT Source: Custom Value: 8192

        • Name: RUNAI_MODEL_MAX_LENGTH Source: Custom Value: 16384

      • Click CREATE ENVIRONMENT

      • Select the newly created environment from the side pane

  12. Click the load icon. A side pane appears, displaying a list of available compute resources. Select ‘cpu-only’ from the list.

    • If ‘cpu-only’ is not displayed, follow the below steps:

      • Click the + icon to create a new compute resource

      • Enter cpu-only as the name for the compute resource. The name must be unique.

      • Set GPU devices per pod - 0

      • Set CPU compute per pod - 0.1 cores

      • Set the CPU memory per pod - 100 MB (default)

      • Click CREATE COMPUTE RESOURCE

      • Select the newly created compute resource from the side pane

  13. Click CREATE WORKSPACE

Step 5: Submitting the Second Workspace

  1. Go to the Workload manager → Workloads

  2. Click COLUMNS and select Connections

  3. Select the link under the Connections column for the second inference workload created in Step 3

  4. In the Connections Associated with Workload form, copy the URL under the Address column

  5. Click +NEW WORKLOAD and select Workspace

  6. Select the cluster where the previous inference workloads were created

  7. Select the project where the previous inference workloads were created

  8. Select Start from scratch to launch a new workspace quickly

  9. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  10. Click CONTINUE

    In the next step:

  11. Click the load icon. A side pane appears, displaying a list of available environments. Select the environment created in Step 4.

    • Set the runtime settings for the environment with the following environment variables:

      • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

      • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the Address link

      • Delete the PATH_PREFIX environment variable if you are using host-based routing.

  12. Click the load icon. A side pane appears, displaying a list of available compute resources. Select the compute resources created in Step 4.

  13. Click CREATE WORKSPACE

Step 6: Connecting to Chatbot-UI

  1. Select the newly created workspace that you want to connect to

  2. Click CONNECT

  3. Select the ChatbotUI tool. The selected tool is opened in a new tab on your browser.

  4. Query both workspaces simultaneously and see them both responding. The one on CPU RAM at the time will take longer as it switches back to the GPU and vice versa.

Next Steps

Manage and monitor your newly created workloads using the Workloads table.

Last updated