Policy YAML Reference
A workload policy is an end-to-end solution for AI managers and administrators to control and simplify how workloads are submitted, setting best practices, enforcing limitations, and standardizing processes for AI projects within their organization.
This article explains the policy YAML fields and the possible rules and defaults that can be set for each field.
Policy YAML Fields - Reference Table
The policy fields are structured in a similar format to the workload API fields. The following tables represent a structured guide designed to help you understand and configure policies in a YAML format. It provides the fields, descriptions, defaults and rules for each workload type.
Click the link to view the value type of each field.
args
When set, contains the arguments sent along with the command. These override the entry point of the image in the created workload
Workspace
Standard training
Distributed training
command
A command to serve as the entry point of the container running the workspace
Workspace
Standard training
Distributed training
createHomeDir
Instructs the system to create a temporary home directory for the user within the container. Data stored in this directory is not saved when the container exists. When the runAsUser flag is set to true, this flag defaults to true as well
Workspace
Standard training
Distributed training
environmentVariables
Set of environmentVariables to populate the container running the workspace
Workspace
Standard training
Distributed training
image
Specifies the image to use when creating the container running the workload
Workspace
Standard training
Distributed training
imagePullPolicy
Specifies the pull policy of the image when starting t a container running the created workload. Options are: always, ifNotPresent, or never
Workspace
Standard training
Distributed training
workingDir
Container’s working directory. If not specified, the container runtime default is used, which might be configured in the container image
Workspace
Standard training
Distributed training
nodeType
Nodes (machines) or a group of nodes on which the workload runs
Workspace
Standard training
Distributed training
nodePools
A prioritized list of node pools for the scheduler to run the workspace on. The scheduler always tries to use the first node pool before moving to the next one when the first is not available.
Workspace
Standard training
Distributed training
annotations
Set of annotations to populate into the container running the workspace
Workspace
Standard training
Distributed training
labels
Set of labels to populate into the container running the workspace
Workspace
Standard training
Distributed training
terminateAfterPreemtpion
Indicates whether the job should be terminated, by the system, after it has been preempted
Workspace
Standard training
Distributed training
autoDeletionTimeAfterCompletionSeconds
Specifies the duration after which a finished workload (Completed or Failed) is automatically deleted. If this field is set to zero, the workload becomes eligible to be deleted immediately after it finishes.
Workspace
Standard training
Distributed training
backoffLimit
Specifies the number of retries before marking a workload as failed
Workspace
Standard training
Distributed training
cleanPodPolicy
Specifies which pods will be deleted when the workload reaches a terminal state (completed/failed). The policy can be one of the following values:
Running
- Only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default).All
- All (including completed) pods will be deleted immediately when the job finishes.None
- No pods will be deleted when the job completes. It will keep running pods that consume GPU, CPU and memory over time. It is recommended to set to None only for debugging and obtaining logs from running pods.
Distributed training
completions
Used with Hyperparameter Optimization. Specifies the number of successful pods the job should reach to be completed. The Job is marked as successful once the specified amount of pods has succeeded.
Standard training
parallelism
Used with Hyperparameters Optimization. Specifies the maximum desired number of pods the workload should run at any given time.
Standard training
exposedUrls
Specifies a set of exported URL (e.g. ingress) from the container running the created workload.
Workspace
Standard training
Distributed training
relatedUrls
Specifies a set of URLs related to the workload. For example, a URL to an external server providing statistics or logging about the workload.
Workspace
Standard training
Distributed training
PodAffinitySchedulingRule
Indicates if we want to use the Pod affinity rule as: the “hard” (required) or the “soft” (preferred) option. This field can be specified only if PodAffinity is set to true.
Workspace
Standard training
Distributed training
podAffinityTopology
Specifies the Pod Affinity Topology to be used for scheduling the job. This field can be specified only if PodAffinity is set to true.
Workspace
Standard training
Distributed training
ports
Specifies a set of ports exposed from the container running the created workload. More information in Ports fields below.
Workspace
Standard training
Distributed training
probes
Specifies the ReadinessProbe to use to determine if the container is ready to accept traffic. More information in Probes fields below
-
Workspace
Standard training
Distributed training
tolerations
Toleration rules which apply to the pods running the workload. Toleration rules guide (but do not require) the system to which node each pod can be scheduled to or evicted from, based on matching between those rules and the set of taints defined for each Kubernetes node.
Workspace
Standard training
Distributed training
priorityClass
Priority class of the workload. The values for workspace are build (default) or interactive-preemptible. For training only, use train. Enum: "build", "train", "interactive-preemptible"
Workspace
storage
Contains all the fields related to storage configurations. More information in Storage fields below.
-
Workspace
Standard training
Distributed training
security
Contains all the fields related to security configurations. More information in Security fields below.
-
Workspace
Standard training
Distributed training
compute
Contains all the fields related to compute configurations. More information in Compute fields below.
-
Workspace
Standard training
Distributed training
tty
Whether this container should allocate a TTY for itself, also requires 'stdin' to be true
Workspace
Standard training
Distributed training
stdin
Whether this container should allocate a buffer for stdin in the container runtime. If this is not set, reads from stdin in the container will always result in EOF
Workspace
Standard training
Distributed training
numWorkers
The number of workers that will be allocated for running the workload.
Distributed training
distributedFramework
The distributed training framework used in the workload.
Enum: "MPI" "PyTorch" "TF" "XGBoost"
Distributed training
slotsPerWorker
Specifies the number of slots per worker used in hostfile. Defaults to 1. (applicable only for MPI)
Distributed training (MPI only)
minReplicas
The lower limit for the number of worker pods to which the training job can scale down. (applicable only for PyTorch)
Distributed training (PyTorch only)
maxReplicas
The upper limit for the number of worker pods that can be set by the autoscaler. Cannot be smaller than MinReplicas. (applicable only for PyTorch)
Distributed training (PyTorch only)
Ports Fields
container
The port that the container running the workload exposes.
Workspace
Standard training
Distributed training
serviceType
Specifies the default service exposure method for ports. the default shall be used for ports which do not specify service type. Options are: LoadBalancer, NodePort or ClusterIP. For more information see the External Access to Containers guide.
Workspace
Standard training
Distributed training
external
The external port which allows a connection to the container port. If not specified, the port is auto-generated by the system.
Workspace
Standard training
Distributed training
toolName
A name describing the tool that runs on this port.
Workspace
Standard training
Distributed training
Probes Fields
readiness
Specifies the Readiness Probe to use to determine if the container is ready to accept traffic.
-
Workspace
Standard training
Distributed training
Security Fields
uidGidSource
Indicates the way to determine the user and group ids of the container. The options are:
fromTheImage
- user and group IDs are determined by the docker image that the container runs. This is the default option.custom
- user and group IDs can be specified in the environment asset and/or the workspace creation request.idpToken
- user and group IDs are determined according to the identity provider (idp) access token. This option is intended for internal use of the environment UI form. For more information, see User identity in containers.
Workspace
Standard training
Distributed training
capabilities
The capabilities field allows adding a set of unix capabilities to the container running the workload. Capabilities are Linux distinct privileges traditionally associated with superuser which can be independently enabled and disabled
Workspace
Standard training
Distributed training
seccompProfileType
Indicates which kind of seccomp profile is applied to the container. The options are:
RuntimeDefault - the container runtime default profile should be used
Unconfined - no profile should be applied
Workspace
Standard training
Distributed training
runAsNonRoot
Indicates that the container must run as a non-root user.
Workspace
Standard training
Distributed training
readOnlyRootFilesystem
If true, mounts the container's root filesystem as read-only.
Workspace
Standard training
Distributed training
runAsUid
Specifies the Unix user id with which the container running the created workload should run.
Workspace
Standard training
Distributed training
runasGid
Specifies the Unix Group ID with which the container should run.
Workspace
Standard training
Distributed training
supplementalGroups
Comma separated list of groups that the user running the container belongs to, in addition to the group indicated by runAsGid.
Workspace
Standard training
Distributed training
allowPrivilegeEscalation
Allows the container running the workload and all launched processes to gain additional privileges after the workload starts
Workspace
Standard training
Distributed training
hostIpc
Whether to enable hostIpc. Defaults to false.
Workspace
Standard training
Distributed training
Compute Fields
cpuCoreRequest
CPU units to allocate for the created workload (0.5, 1, .etc). The workload receives at least this amount of CPU. Note that the workload is not scheduled unless the system can guarantee this amount of CPUs to the workload.
Workspace
Standard training
Distributed training
cpuCoreLimit
Limitations on the number of CPUs consumed by the workload (0.5, 1, .etc). The system guarantees that this workload is not able to consume more than this amount of CPUs.
Workspace
Standard training
Distributed training
cpuMemoryRequest
The amount of CPU memory to allocate for this workload (1G, 20M, .etc). The workload receives at least this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of memory to the workload
Workspace
Standard training
Distributed training
cpuMemoryLimit
Limitations on the CPU memory to allocate for this workload (1G, 20M, .etc). The system guarantees that this workload is not be able to consume more than this amount of memory. The workload receives an error when trying to allocate more memory than this limit.
Workspace
Standard training
Distributed training
largeShmRequest
A large /dev/shm device to mount into a container running the created workload (shm is a shared file system mounted on RAM).
Workspace
Standard training
Distributed training
gpuRequestType
Sets the unit type for GPU resources requests to either portion, memory or mig profile. Only if gpuDeviceRequest = 1
, the request type can be stated as portion
, memory
or migProfile
.
Workspace
Standard training
Distributed training
migProfile (Deprecated)
Specifies the memory profile to be used for workload running on NVIDIA Multi-Instance GPU (MIG) technology.
Workspace
Standard training
Distributed training
gpuPortionRequest
Specifies the fraction of GPU to be allocated to the workload, between 0 and 1. For backward compatibility, it also supports the number of gpuDevices larger than 1, currently provided using the gpuDevices field.
Workspace
Standard training
Distributed training
gpuDeviceRequest
Specifies the number of GPUs to allocate for the created workload. Only if gpuDeviceRequest = 1
, the gpuRequestType can be defined.
Workspace
Standard training
Distributed training
gpuPortionLimit
When a fraction of a GPU is requested, the GPU limit specifies the portion limit to allocate to the workload. The range of the value is from 0 to 1.
Workspace
Standard training
Distributed training
gpuMemoryRequest
Specifies GPU memory to allocate for the created workload. The workload receives this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of GPU memory to the workload.
Workspace
Standard training
Distributed training
gpuMemoryLimit
Specifies a limit on the GPU memory to allocate for this workload. Should be no less than the gpuMemory.
Workspace
Standard training
Distributed training
extendedResources
Specifies values for extended resources. Extended resources are third-party devices (such as high-performance NICs, FPGAs, or InfiniBand adapters) that you want to allocate to your Job.
Workspace
Standard training
Distributed training
Storage Fields
dataVolume
Set of data volumes to use in the workload. Each data volume is mapped to a file-system mount point within the container running the workload.
Workspace
Standard training
Distributed training
Maps a folder to a file-system mount point within the container running the workload.
Workspace
Standard training
Distributed training
Details of the git repository and items mapped to it.
Workspace
Standard training
Distributed training
Specifies persistent volume claims to mount into a container running the created workload.
Workspace
Standard training
Distributed training
Specifies NFS volume to mount into the container running the workload.
Workspace
Standard training
Distributed training
Specifies S3 buckets to mount into the container running the workload.
Workspace
Standard training
Distributed training
configMapVolumes
Specifies ConfigMaps to mount as volumes into a container running the created workload.
Workspace
Standard training
Distributed training
secretVolume
Set of secret volumes to use in the workload. A secret volume maps a secret resource in the cluster to a file-system mount point within the container running the workload.
Workspace
Standard training
Distributed training
Storage Field Examples
Value Types
Each field has a specific value type. The following value types are supported.
Boolean
A binary value that can be either True or False
canEdit
required
true/false
String
A sequence of characters used to represent text. It can include letters, numbers, symbols, and spaces
canEdit
required
options
abc
Itemized
An ordered collection of items (objects), which can be of different types (all items in the list are of the same type). For further information see the chapter below the table.
canAdd
locked
See below
Integer
An Integer is a whole number without a fractional component.
canEdit
required
min
max
step
defaultFrom
100
Number
Capable of having non-integer values
canEdit
required
min
defaultFrom
10.3
Quantity
Holds a string composed of a number and a unit representing a quantity
canEdit
required
min
max
defaultFrom
5M
Array
Set of values that are treated as one, as opposed to Itemized in which each item can be referenced separately.
canEdit
required
node-a
node-b
node-c
Itemized
Workload fields of type itemized have multiple instances, however in comparison to objects, each can be referenced by a key field. The key field is defined for each field.
Consider the following workload spec:
spec:
image: ubuntu
compute:
extendedResources:
- resource: added/cpu
quantity: 10
- resource: added/memory
quantity: 20M
In this example, extendedResources have two instances, each has two attributes: resource (the key attribute) and quantity.
In policy, the defaults and rules for itemized fields have two sub sections:
Instances: default items to be added to the policy or rules which apply to an instance as a whole.
Attributes: defaults for attributes within an item or rules which apply to attributes within each item.
Consider the following example:
defaults:
compute:
extendedResources:
instances:
- resource: default/cpu
quantity: 5
- resource: default/memory
quantity: 4M
attributes:
quantity: 3
rules:
compute:
extendedResources:
instances:
locked:
- default/cpu
attributes:
quantity:
required: true
Assume the following workload submission is requested:
spec:
image: ubuntu
compute:
extendedResources:
- resource: default/memory
exclude: true
- resource: added/cpu
- resource: added/memory
quantity: 5M
The effective policy for the above mentioned workload has the following extendedResources instances:
default/cpu
Policy defaults
5
The default of this instance in the policy defaults section
added/cpu
Submission request
3
The default of the quantity attribute from the attributes section
added/memory
Submission request
5M
Submission request
A workload submission request cannot exclude the default/cpu resource, as this key is included in the locked rules under the instances section. {#a-workload-submission-request-cannot-exclude-the-default/cpu-resource,-as-this-key-is-included-in-the-locked-rules-under-the-instances-section.}
Rule Types
canAdd
Whether the submission request can add items to an itemized field other than those listed in the policy defaults for this field.
locked
Set of items that the workload is unable to modify or exclude. In this example, a workload policy default is given to HOME and USER, that the submission request cannot modify or exclude from the workload.
canEdit
Whether the submission request can modify the policy default for this field. In this example, it is assumed that the policy has default for imagePullPolicy. As canEdit is set to false, submission requests are not able to alter this default.
required
When set to true, the workload must have a value for this field. The value can be obtained from policy defaults. If no value specified in the policy defaults, a value must be specified for this field in the submission request.
step
The allowed gap between values for this field. In this example the allowed values are: 1, 3, 5, 7
Rule Type Examples
Policy Spec Sections
For each field of a specific policy, you can specify both rules and defaults. A policy spec consists of the following sections:
Rules
Defaults
Imposed Assets
Rules
Rules set up constraints on workload policy fields. For example, consider the following policy:
rules:
compute:
gpuDevicesRequest:
max: 8
security:
runAsUid:
min: 500
Such a policy restricts the maximum value for gpuDeviceRequests to 8, and the minimal value for runAsUid, provided in the security section to 500.
Defaults
The defaults section is used for providing defaults for various workload fields. For example, consider the following policy:
defaults:
imagePullPolicy: Always
security:
runAsNonRoot: true
runAsUid: 500
Assume a submission request with the following values:
Image: ubuntu
runAsUid: 501
The effective workload that runs has the following set of values:
Image
Ubuntu
Submission request
ImagePullPolicy
Always
Policy defaults
security.runAsNonRoot
true
Policy defaults
security.runAsUid
501
Submission request
Imposed Assets
Default instances of a storage field can be provided using a datasource containing the details of this storage instance. To add such instances in the policy, specify those asset IDs in the imposedAssets section of the policy.
defaults: null
rules: null
imposedAssets:
- f12c965b-44e9-4ff6-8b43-01d8f9e630cc
Assets with references to credential assets (for example: private S3, containing reference to an AccessKey asset) cannot be used as imposedAssets.
Last updated