# Run Your First Custom Inference Workload

This quick start provides a step-by-step walkthrough for running and querying a custom inference workload.

An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.

## Prerequisites

Before you start, make sure:

* You have created a [project](https://run-ai-docs.nvidia.com/self-hosted/2.20/platform-management/aiinitiatives/organization/projects) or have one created for you.
* The project has an assigned quota of at least 1 GPU.
* [Knative](https://run-ai-docs.nvidia.com/self-hosted/2.20/getting-started/installation/install-using-helm/system-requirements#inference) is properly installed by your administrator.

{% hint style="info" %}
**Note**

Selecting the Inference type is disabled by default. If you cannot see it in the menu, then it must be enabled by your Administrator, under **General settings** → Workloads → Models.
{% endhint %}

## Step 1: Logging In

{% tabs %}
{% tab title="UI" %}
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
{% endtab %}

{% tab title="API" %}
To use the API, you will need to obtain a token as shown in [API authentication](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/getting-started/how-to-authenticate-to-the-api).
{% endtab %}
{% endtabs %}

## Step 2: Submitting an Inference Workload

{% tabs %}
{% tab title="UI" %}

1. Go to Workload manager → Workloads.
2. Click **+NEW WORKLOAD** and select **Inference**
3. Select under which **cluster** to create the workload
4. Select the **project** in which your workload will run
5. Select **custom** inference from **Inference type** (if applicable)
6. Enter a unique **name** for the workload. If the name already exists in the project, you will be requested to submit a different name.
7. Under **Submission**, select **Original** and click **CONTINUE**
8. Create an environment for your workload

   * Click **+NEW ENVIRONMENT**
   * Enter a **name** for the environment. The name must be unique.
   * Enter the **Image URL** - `runai.jfrog.io/demo/example-triton-server`
   * Set the inference **serving endpoint** to **HTTP** and the container port to `8000`
   * Click **CREATE ENVIRONMENT**

   The newly created environment will be selected automatically
9. Select the **‘half-gpu’** compute resource for your workload (GPU devices: 1)

   * If ‘half-gpu’ is not displayed in the gallery, follow the below steps:
     * Click **+NEW COMPUTE RESOURCE**
     * Enter half-gpu as the **name** for the compute resource. The name must be unique.
     * Set **GPU devices per pod** - 1
     * Set **GPU memory per device**
       * Select **% (of device) -** Fraction of a GPU device’s memory
       * Set the memory **Request** - 50 (the workload will allocate 50% of the GPU memory)
     * Optional: set the **CPU compute per pod** - 0.1 cores (default)
     * Optional: set the **CPU memory per pod** - 100 MB (default)
     * Click **CREATE COMPUTE RESOURCE**

   The newly created compute resource will be selected automatically
10. Under **Replica autoscaling**:
    * Set a **minimum** of 1 replica and **maximum** of 2 replicas
    * Set the conditions for creating a new replica to **Concurrency (Requests)** and the **value** to 3
    * Set when the replicas should be automatically scaled down to zero to **After 5 minutes of inactivity**. When automatic scaling to zero is enabled, the minimum number of replicas set in the previous step, automatically changes to 0.
11. Click **CREATE INFERENCE**

This would start a triton inference server with a maximum of 2 instances, each instance consumes half a GPU.
{% endtab %}

{% tab title="API" %}
Copy the following command to your terminal. Make sure to update the below parameters. For more details, see [Inferences](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/workloads/inferences) API:

```bash
curl -L 'https://<COMPANY-URL>/api/v1/workloads/inferences' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{ 
    "name": "workload-name", 
    "useGivenNameAsPrefix": true,
    "projectId": "<PROJECT-ID>", 
    "clusterId": "<CLUSTER-UUID>", 
    "spec": {
        "image": "runai.jfrog.io/demo/example-triton-server",
        "servingPort": {
            "protocol": "http",
            "container": 8000
        },
        "autoscaling": {
            "minReplicas": 0,
            "maxReplicas": 2,
            "metric": "concurrency",
            "metricThreshold": 3,
            "scaleToZeroRetentionSeconds": 300
        },
        "compute": {
            "cpuCoreRequest": 0.1,
            "gpuRequestType": "portion",
            "cpuMemoryRequest": "100M",
            "gpuDevicesRequest": 1,
            "gpuPortionRequest": 0.5
        }
    }
}'      
```

* `<COMPANY-URL>` - The link to the NVIDIA Run:ai user interface
* `<TOKEN>` - The API access token obtained in [Step 1](#step-1-logging-in)
* `<PROJECT-ID>` - The ID of the Project the workload is running on. You can get the Project ID via the [Get Projects](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/organizations/projects#get-api-v1-org-unit-projects) API.
* `<CLUSTER-UUID>` - The unique identifier of the Cluster. You can get the Cluster UUID via the [Get Clusters](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/organizations/clusters#get-api-v1-clusters) API.

{% hint style="info" %}
**Note**

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.
{% endhint %}
{% endtab %}
{% endtabs %}

## Step 3: Querying the Inference Server

In this step, you'll test the deployed model by sending a request to the inference server. To do this, you'll launch a general-purpose workload, typically a **Training** or **Workspace** workload, to run the Triton demo client. You'll first retrieve the workload address, which serves as the model’s inference serving endpoint. Then, use the client to send a sample request and verify that the model is responding correctly.

{% tabs %}
{% tab title="UI" %}

1. Go to the Workload manager → Workloads.
2. Click COLUMNS and select **Connections.**
3. Select the link under the Connections column for the inference workload created in [Step 2](#step-2-submitting-an-inference-workload)
4. In the **Connections Associated with Workload form,** copy the URL under the **Address** column
5. Click **+NEW WORKLOAD** and select **Training**
6. Select the **cluster** and **project** where the inference workload was created
7. Under **Workload architecture**, select **Standard**
8. Select **Start from scratch** to launch a new workload quickly
9. Enter a unique **name** for the workload. If the name already exists in the project, you will be requested to submit a different name.
10. Under **Submission**, select **Original** and click **CONTINUE**
11. Create an environment for your workload

    * Click **+NEW ENVIRONMENT**
    * Enter a **name** for the environment. The name must be unique.
    * Enter the **Image URL** - `runai.jfrog.io/demo/example-triton-client`
    * Set the runtime settings for the environment. Click **+COMMAND & ARGUMENTS** and add the following:
      * Enter the command: `perf_analyzer`
      * Enter the arguments: `-m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>`. Make sure to replace the inference endpoint with the **Address** you retrieved above.
    * Click **CREATE ENVIRONMENT**

    The newly created environment will be selected automatically
12. Select the **‘cpu-only’** compute resource for your workspace

    * If ‘cpu-only’ is not displayed in the gallery, follow the below steps:
      * Click **+NEW COMPUTE RESOURCE**
      * Enter cpu-only as the **name** for the compute resource. The name must be unique.
      * Set **GPU devices** **per pod** - 0
      * Set **CPU compute** **per pod** - 0.1 cores
      * Set the **CPU memory** **per pod** - 100 MB (default)
      * Click **CREATE COMPUTE RESOURCE**

    The newly created compute resource will be selected automatically
13. Click **CREATE TRAINING**
    {% endtab %}

{% tab title="API" %}
Copy the following command to your terminal. Make sure to update the below parameters according. For more details, see [Trainings](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/workloads/trainings) API:

```bash
curl -L 'https://<COMPANY-URL>/api/v1/workloads/trainings' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ 
-d '{ 
    "name": "workload-name", 
    "projectId": "<PROJECT-ID>", 
    "clusterId": "<CLUSTER-UUID>",
    "spec": {  
        "image": "runai.jfrog.io/demo/example-triton-client",
        "command": "perf_analyzer",
        "args": "-m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>",
        "compute": {
            "cpuCoreRequest":0.1,
            "cpuMemoryRequest": "100M",
        }
    }
}   
```

* `<COMPANY-URL>` - The link to the NVIDIA Run:ai user interface
* `<TOKEN>` - The API access token obtained in [Step 1](#step-1-logging-in)
* `<PROJECT-ID>` - The ID of the Project the workload is running on. You can get the Project ID via the [Get Projects](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/organizations/projects#get-api-v1-org-unit-projects) API.
* `<CLUSTER-UUID>` - The unique identifier of the Cluster. You can get the Cluster UUID via the [Get Clusters](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/organizations/clusters#get-api-v1-clusters) API.
* `<INFERENCE-ENDPOINT>` - You can get the inference endpoint from the urls parameter via the [Get Workloads](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/workloads/workloads#get-api-v1-workloads) API.

{% hint style="info" %}
**Note**

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.
{% endhint %}
{% endtab %}
{% endtabs %}

## Next Steps

* Select the inference workload you created in [Step 2](#step-2-submitting-an-inference-workload) and go to the [Metrics](https://run-ai-docs.nvidia.com/self-hosted/2.20/workloads#metrics) tab to see various GPU and inference metrics graphs rise.
* Manage and monitor your newly created workload using the [Workloads](https://run-ai-docs.nvidia.com/self-hosted/2.20/workloads-in-nvidia-run-ai/workloads) table.