Run Your First Custom Inference Workload
This quick start provides a step-by-step walkthrough for running and querying a custom inference workload.
An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.
Prerequisites
Before you start, make sure:
You have created a project or have one created for you.
The project has an assigned quota of at least 1 GPU.
Knative is properly installed by your administrator.
Step 1: Logging In
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
Run the below --help command to obtain the login options and log in according to your setup:
runai login --helpTo use the API, you will need to obtain a token as shown in API authentication.
Step 2: Submitting an Inference Workload
Go to Workload manager → Workloads
Click +NEW WORKLOAD and select Inference
Select under which cluster to create the workload
Select the project in which your workload will run
Select custom inference from Inference type (if applicable)
Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/example-triton-serverSet the inference serving endpoint to HTTP and the container port to
8000Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘half-gpu’ compute resource for your workload.
If ‘half-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 1
Enable GPU fractioning to set the GPU memory per device
Select % (of device) - Fraction of a GPU device’s memory
Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Under Replica autoscaling:
Set a minimum of 0 replicas and maximum of 2 replicas
Set the conditions for creating a new replica to Concurrency (Requests) and the value to 3
Set when the replicas should be automatically scaled down to zero to After 5 minutes of inactivity
Click CREATE INFERENCE
This would start a triton inference server with a maximum of 2 instances, each instance consumes half a GPU.
Go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Inference
Select under which cluster to create the workload
Select the project in which your workload will run
Select custom inference from Inference type (if applicable)
Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
In the next step:
Create an environment for your workload
Click +NEW ENVIRONMENT
Enter a name for the environment. The name must be unique.
Enter the Image URL -
runai.jfrog.io/demo/example-triton-serverSet the inference serving endpoint to HTTP and the container port to
8000Click CREATE ENVIRONMENT
The newly created environment will be selected automatically
Select the ‘half-gpu’ compute resource for your workload
If ‘half-gpu’ is not displayed in the gallery, follow the below steps:
Click +NEW COMPUTE RESOURCE
Enter a name for the compute resource. The name must be unique.
Set GPU devices per pod - 1
Enable GPU fractioning to set the GPU memory per device
Select % (of device) - Fraction of a GPU device’s memory
Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
Under Replica autoscaling:
Set a minimum of 0 replicas and maximum of 2 replicas
Set the conditions for creating a new replica to Concurrency (Requests) and set the value to 3
Set when the replicas should be automatically scaled down to zero to After 5 minutes of inactivity
Click CREATE INFERENCE
This would start a triton inference server with a maximum of 2 instances, each instance consumes half a GPU.
Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see CLI reference:
runai project set <project_name>
runai inference submit -i runai.jfrog.io/demo/example-triton-server \
--gpu-devices-request 1 --gpu-memory-request 50M \
--serving-port=8000 --min-replicas=0 --max-replicas=2 \
--metric=concurrency --metric-threshold 3 \
--scale-to-zero-retention-seconds 300Copy the following command to your terminal. Make sure to update the below parameters. For more details, see Inferences API:
curl -L 'https://<COMPANY-URL>/api/v1/workloads/inferences' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"name": "workload-name",
"useGivenNameAsPrefix": true,
"projectId": "<PROJECT-ID>",
"clusterId": "<CLUSTER-UUID>",
"spec": {
"image": "runai.jfrog.io/demo/example-triton-server",
"servingPort": {
"protocol": "http",
"container": 8000
},
"autoscaling": {
"minReplicas": 0,
"maxReplicas": 2,
"metric": "concurrency",
"metricThreshold": 3,
"scaleToZeroRetentionSeconds": 300
},
"compute": {
"cpuCoreRequest": 0.1,
"gpuRequestType": "portion",
"cpuMemoryRequest": "100M",
"gpuDevicesRequest": 1,
"gpuPortionRequest": 0.5
}
}
}' <COMPANY-URL>- The link to the NVIDIA Run:ai user interface<TOKEN>- The API access token obtained in Step 1<PROJECT-ID>- The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.<CLUSTER-UUID>- The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.
Step 3: Querying the Inference Server
In this step, you'll test the deployed model by sending a request to the inference server. To do this, you'll launch a general-purpose workload, typically a Training or Workspace workload, to run the Triton demo client. You'll first retrieve the workload address, which serves as the model’s inference serving endpoint. Then, use the client to send a sample request and verify that the model is responding correctly.
Go to the Workload manager → Workloads.
Click COLUMNS and select Connections.
Select the link under the Connections column for the inference workload created in Step 2
In the Connections Associated with Workload form, copy the URL under the Address column
Click +NEW WORKLOAD and select Training
Select the cluster and project where the inference workload was created
Under Workload architecture, select Standard
Select Start from scratch to launch a new workload quickly
Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/example-triton-clientSet the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:
Enter the command -
perf_analyzerEnter the arguments -
-m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>. Make sure to replace the inference endpoint with the Address you retrieved above.
Click the load icon. A side pane appears, displaying a list of available compute resources. Select ‘cpu-only’ from the list.
If ‘cpu-only’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 0
Set CPU compute per pod - 0.1 cores
Set the CPU memory per pod - 100 MB (default)
Click CREATE TRAINING
Go to the Workload manager → Workloads.
Click COLUMNS and select Connections.
Select the link under the Connections column for the inference workload created in Step 2
In the Connections Associated with Workload form, copy the URL under the Address column
Click +NEW WORKLOAD and select Training
Select the cluster and project where the inference workload was created
Under Workload architecture, select Standard
Select Start from scratch to launch a new workload quickly
Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
In the next step:
Create an environment for your workload
Click +NEW ENVIRONMENT
Enter quick-start as the name for the environment. The name must be unique.
Enter the Image URL -
runai.jfrog.io/demo/example-triton-clientSet the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:
Enter the command:
perf_analyzerEnter the arguments:
-m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>. Make sure to replace the inference endpoint with the Address you retrieved above.
Click CREATE ENVIRONMENT
The newly created environment will be selected automatically
Select the ‘cpu-only’ compute resource for your workspace
If ‘cpu-only’ is not displayed in the gallery, follow the below steps:
Click +NEW COMPUTE RESOURCE
Enter cpu-only as the name for the compute resource. The name must be unique.
Set GPU devices per pod - 0
Set CPU compute per pod - 0.1 cores
Set the CPU memory per pod - 100 MB (default)
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
Click CREATE TRAINING
Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. To retrieve the inference endpoint, use the runai inference describe command. For more details, see CLI reference:
runai project set team-a
runai training submit <name> -i runai.jfrog.io/demo/example-triton-client \
-- perf_analyzer -m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT> Copy the following command to your terminal. Make sure to update the below parameters according. For more details, see Trainings API:
curl -L 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"name": "workload-name",
"projectId": "<PROJECT-ID>",
"clusterId": "<CLUSTER-UUID>",
"spec": {
"image": "runai.jfrog.io/demo/example-triton-client",
"command": "perf_analyzer",
"args": "-m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>",
"compute": {
"cpuCoreRequest":0.1,
"cpuMemoryRequest": "100M",
}
}
} <COMPANY-URL>- The link to the NVIDIA Run:ai user interface<TOKEN>- The API access token obtained in Step 1<PROJECT-ID>- The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.<CLUSTER-UUID>- The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.<INFERENCE-ENDPOINT>- You can get the inference endpoint from the urls parameter via the Get Workloads API.
Next Steps
Manage and monitor your newly created workload using the Workloads table.
Last updated