Launching workloads with GPU memory swap
This quick start provides a step-by-step walkthrough for running multiple LLMs (inference workload) on a single GPU using GPU memory swap.
GPU memory swap expands the GPU physical memory to the CPU memory, allowing NVIDIA Run:ai to place and run more workloads on the same GPU physical hardware. This provides a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide.
Note
If enabled by your Administrator, the NVIDIA Run:ai UI allows you to create a new workload using either the Flexible or Original submission form. The steps in this quick start guide reflect the Original form only.
Prerequisites
Before you start, make sure:
You have created a project or have one created for you.
The project has an assigned quota of at least 1 GPU.
Dynamic GPU fractions is enabled.
GPU memory swap is enabled on at least one free node as detailed here.
Host-based routing is configured.
Note
Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.
Step 1: Logging in
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
Step 2: Submitting the first inference workload
Go to the Workload manager → Workloads
Click +NEW WORKLOAD and select Inference
Select under which cluster to create the workload
Select the project in which your workload will run
Select custom inference from Inference type
Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)
Click CONTINUE
In the next step:
Create an environment for your workload
Click +NEW ENVIRONMENT
Enter a name for the environment. The name must be unique.
Enter the NVIDIA Run:ai vLLM Image URL -
runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0
Set the runtime settings for the environment
Click +ENVIRONMENT VARIABLE and add the following
Name: RUNAI_MODEL Source: Custom Value:
meta-llama/Llama-3.2-1B-Instruct
(you can choose any vLLM supporting model from Hugging Face)Name: RUNAI_MODEL_NAME Source: Custom Value:
Llama-3.2-1B-Instruct
Name: HF_TOKEN Source: Custom Value: <Your Hugging Face token> (only needed for gated models)
Name: VLLM_RPC_TIMEOUT Source: Custom Value: 60000
Click CREATE ENVIRONMENT
The newly created environment will be selected automatically
Create a new “request-limit” compute resource
Click +NEW COMPUTE RESOURCE
Enter a name for the compute resource. The name must be unique.
Set GPU devices per pod - 1
Set GPU memory per device
Select % (of device) - Fraction of a GPU device’s memory
Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)
Toggle Limit and set to 100%
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Select More settings and toggle Increase shared memory size
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
Click CREATE INFERENCE
Step 3: Submitting the second inference workload
Go to the Workload manager → Workloads
Click +NEW WORKLOAD and select Inference
Select the cluster where the previous inference workload was created
Select the project where the previous inference workload was created
Select custom inference from Inference type
Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)
Click CONTINUE
In the next step:
Select the environment created in Step 2
Select the compute resource created in Step 2
Click CREATE INFERENCE
Step 4: Submitting the first workspace
Go to the Workload manager → Workloads
Click COLUMNS and select Connections
Select the link under the Connections column for the first inference workload created in Step 2
In the Connections Associated with Workload form, copy the URL under the Address column
Click +NEW WORKLOAD and select Workspace
Select the cluster where the previous inference workloads were created
Select the project where the previous inference workloads were created
Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)
Click CONTINUE
In the next step:
Select the ‘chatbot-ui’ environment for your workspace (Image URL:
runai.jfrog.io/core-llm/llm-app
)Set the runtime settings for the environment with the following environment variables:
Name: RUNAI_MODEL_NAME Source: Custom Value:
meta-llama/Llama-3.2-1B-Instruct
Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4
Delete the PATH_PREFIX environment variable if you are using host-based routing.
If ‘chatbot-ui’ is not displayed in the gallery, follow the below steps:
Click +NEW ENVIRONMENT
Enter a name for the environment. The name must be unique.
Enter the chatbot-ui Image URL -
runai.jfrog.io/core-llm/llm-app
Tools - Set the connection for your tool
Click +TOOL
Select Chatbot UI tool from the list
Set the runtime settings for the environment
Click +ENVIRONMENT VARIABLE
Name: RUNAI_MODEL_NAME Source: Custom Value:
meta-llama/Llama-3.2-1B-Instruct
Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4
Name: RUNAI_MODEL_TOKEN_LIMIT Source: Custom Value: 8192
Name: RUNAI_MODEL_MAX_LENGTH Source: Custom Value: 16384
Click CREATE ENVIRONMENT
The newly created environment will be selected automatically
Select the ‘cpu-only’ compute resource for your workspace
If ‘cpu-only’ is not displayed in the gallery, follow the below steps:
Click +NEW COMPUTE RESOURCE
Enter a name for the compute resource. The name must be unique.
Set GPU devices per pod - 0
Set CPU compute per pod - 0.1 cores
Set the CPU memory per pod - 100 MB (default)
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
Click CREATE WORKSPACE
Step 5: Submitting the second workspace
Go to the Workload manager → Workloads
Click COLUMNS and select Connections
Select the link under the Connections column for the second inference workload created in Step 3
In the Connections Associated with Workload form, copy the URL under the Address column
Click +NEW WORKLOAD and select Workspace
Select the cluster where the previous inference workloads were created
Select the project where the previous inference workloads were created
Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)
Click CONTINUE
In the next step:
Select the ‘chatbot-ui’ environment created in Step 4
Set the runtime settings for the environment with the following environment variables:
Name: RUNAI_MODEL_NAME Source: Custom Value:
meta-llama/Llama-3.2-1B-Instruct
Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4
Delete the PATH_PREFIX environment variable if you are using host-based routing.
Select the ‘cpu-only’ compute resource created in Step 4
Click CREATE WORKSPACE
Step 6: Connecting to Chatbot-UI
Select the newly created workspace that you want to connect to
Click CONNECT
Select the ChatbotUI tool. The selected tool is opened in a new tab on your browser.
Query both workspaces simultaneously and see them both responding. The one on CPU RAM at the time will take longer as it switches back to the GPU and vice versa.
Next steps
Manage and monitor your newly created workloads using the Workloads table.
Last updated