Run Your First Custom Inference Workload

This quick start provides a step-by-step walkthrough for running and querying a custom inference workload.

An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.

Prerequisites

Before you start, make sure:

  • You have created a project or have one created for you.

  • The project has an assigned quota of at least 1 GPU.

  • Knative is properly installed by your administrator.

Note

  • Flexible workload submission is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.

  • The Custom inference type appears only if your administrator has enabled it under General settings → Workloads → Models. If not enabled, Custom becomes the default inference type and is not displayed as a selectable option.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 2: Submitting an Inference Workload

  1. Go to Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Inference

  3. Select under which cluster to create the workload

  4. Select the project in which your workload will run

  5. Select custom inference from Inference type (if applicable)

  6. Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.

  7. Click CONTINUE

    In the next step:

  8. Under Environment, enter the Image URL - runai.jfrog.io/demo/example-triton-server

  9. Set the inference serving endpoint to HTTP and the container port to 8000

  10. Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘half-gpu’ compute resource for your workload.

    • If ‘half-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 1

      • Enable GPU fractioning to set the GPU memory per device

        • Select % (of device) - Fraction of a GPU device’s memory

        • Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

  11. Under Replica autoscaling:

    • Set a minimum of 0 replicas and maximum of 2 replicas

    • Set the conditions for creating a new replica to Concurrency (Requests) and the value to 3

    • Set when the replicas should be automatically scaled down to zero to After 5 minutes of inactivity

  12. Click CREATE INFERENCE

This would start a triton inference server with a maximum of 2 instances, each instance consumes half a GPU.

Step 3: Querying the Inference Server

In this step, you'll test the deployed model by sending a request to the inference server. To do this, you'll launch a general-purpose workload, typically a Training or Workspace workload, to run the Triton demo client. You'll first retrieve the workload address, which serves as the model’s inference serving endpoint. Then, use the client to send a sample request and verify that the model is responding correctly.

  1. Go to the Workload manager → Workloads.

  2. Click COLUMNS and select Connections.

  3. Select the link under the Connections column for the inference workload created in Step 2

  4. In the Connections Associated with Workload form, copy the URL under the Address column

  5. Click +NEW WORKLOAD and select Training

  6. Select the cluster and project where the inference workload was created

  7. Under Workload architecture, select Standard

  8. Select Start from scratch to launch a new workload quickly

  9. Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.

  10. Click CONTINUE

    In the next step:

  11. Under Environment, enter the Image URL - runai.jfrog.io/demo/example-triton-client

  12. Set the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:

    • Enter the command - perf_analyzer

    • Enter the arguments - -m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>. Make sure to replace the inference endpoint with the Address you retrieved above.

  13. Click the load icon. A side pane appears, displaying a list of available compute resources. Select ‘cpu-only’ from the list.

    • If ‘cpu-only’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 0

      • Set CPU compute per pod - 0.1 cores

      • Set the CPU memory per pod - 100 MB (default)

  14. Click CREATE TRAINING

Next Steps

  • Select the inference workload you created in Step 2 and go to the Metrics tab to see various GPU and inference metrics graphs rise.

  • Manage and monitor your newly created workload using the Workloads table.

Last updated