Run Your First Custom Inference Workload

This quick start provides a step-by-step walkthrough for running and querying a custom inference workload.

An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.

Prerequisites

Before you start, make sure:

  • You have created a project or have one created for you.

  • The project has an assigned quota of at least 1 GPU.

  • Knative is properly installed by your administrator.

Note

Flexible workload submission is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 2: Submitting an Inference Workload

  1. Go to Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Inference

  3. Within the new form, select the cluster and project

  4. Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.

  5. Click CONTINUE

    In the next step:

  6. Under Environment, enter the Image URL - runai.jfrog.io/demo/example-triton-server

  7. Set the inference serving endpoint to HTTP and the container port to 8000

  8. Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘half-gpu’ compute resource for your workload.

    • If ‘half-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 1

      • Enable GPU fractioning to set the GPU memory per device

        • Select % (of device) - Fraction of a GPU device’s memory

        • Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

  9. Under Replica autoscaling:

    • Set a minimum of 0 replicas and maximum of 2 replicas

    • Set the conditions for creating a new replica to Concurrency (Requests) and the value to 3

    • Set when the replicas should be automatically scaled down to zero to After 5 minutes of inactivity

  10. Click CREATE INFERENCE

This would start a triton inference server with a maximum of 2 instances, each instance consumes half a GPU.

Step 3: Querying the Inference Server

In this step, you'll test the deployed model by sending a request to the inference server. To do this, you'll launch a general-purpose workload, typically a Training or Workspace workload, to run the Triton demo client. You'll first retrieve the workload address, which serves as the model’s inference serving endpoint. Then, use the client to send a sample request and verify that the model is responding correctly.

  1. Go to the Workload manager → Workloads.

  2. Click COLUMNS and select Connections.

  3. Select the link under the Connections column for the inference workload created in Step 2

  4. In the Connections Associated with Workload form, copy the URL under the Address column

  5. Click +NEW WORKLOAD and select Training

  6. Select the cluster and project where the inference workload was created

  7. Under Workload architecture, select Standard

  8. Select Start from scratch to launch a new workload quickly

  9. Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.

  10. Click CONTINUE

    In the next step:

  11. Under Environment, enter the Image URL - runai.jfrog.io/demo/example-triton-client

  12. Set the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:

    • Enter the command: perf_analyzer

    • Enter the arguments: -m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>. Make sure to replace the inference endpoint with the Address you retrieved above.

  13. Click the load icon. A side pane appears, displaying a list of available compute resources. Select ‘cpu-only’ from the list.

    • If ‘cpu-only’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 0

      • Set CPU compute per pod - 0.1 cores

      • Set the CPU memory per pod - 100 MB (default)

  14. Click CREATE TRAINING

Next Steps

  • Select the inference workload you created in Step 2 and go to the Metrics tab to see various GPU and inference metrics graphs rise.

  • Manage and monitor your newly created workload using the Workloads table.

Last updated