Run Your First Custom Inference Workload
This quick start provides a step-by-step walkthrough for running and querying a custom inference workload.
An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.
Prerequisites
Before you start, make sure:
You have created a project or have one created for you.
The project has an assigned quota of at least 1 GPU.
Knative is properly installed by your administrator.
Note
Flexible workload submission is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.
Step 1: Logging In
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
Step 2: Submitting an Inference Workload
Go to Workload manager → Workloads
Click +NEW WORKLOAD and select Inference
Within the new form, select the cluster and project
Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/example-triton-server
Set the inference serving endpoint to HTTP and the container port to
8000
Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘half-gpu’ compute resource for your workload.
If ‘half-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 1
Enable GPU fractioning to set the GPU memory per device
Select % (of device) - Fraction of a GPU device’s memory
Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Under Replica autoscaling:
Set a minimum of 0 replicas and maximum of 2 replicas
Set the conditions for creating a new replica to Concurrency (Requests) and the value to 3
Set when the replicas should be automatically scaled down to zero to After 5 minutes of inactivity
Click CREATE INFERENCE
This would start a triton inference server with a maximum of 2 instances, each instance consumes half a GPU.
Step 3: Querying the Inference Server
In this step, you'll test the deployed model by sending a request to the inference server. To do this, you'll launch a general-purpose workload, typically a Training or Workspace workload, to run the Triton demo client. You'll first retrieve the workload address, which serves as the model’s inference serving endpoint. Then, use the client to send a sample request and verify that the model is responding correctly.
Go to the Workload manager → Workloads.
Click COLUMNS and select Connections.
Select the link under the Connections column for the inference workload created in Step 2
In the Connections Associated with Workload form, copy the URL under the Address column
Click +NEW WORKLOAD and select Training
Select the cluster and project where the inference workload was created
Under Workload architecture, select Standard
Select Start from scratch to launch a new workload quickly
Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/example-triton-client
Set the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:
Enter the command:
perf_analyzer
Enter the arguments:
-m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>
. Make sure to replace the inference endpoint with the Address you retrieved above.
Click the load icon. A side pane appears, displaying a list of available compute resources. Select ‘cpu-only’ from the list.
If ‘cpu-only’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 0
Set CPU compute per pod - 0.1 cores
Set the CPU memory per pod - 100 MB (default)
Click CREATE TRAINING
Next Steps
Manage and monitor your newly created workload using the Workloads table.
Last updated