'Cant run GPU pod - 0/12 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}

Trying to create GPU node in my Azure cluster. I am following this instruction - https://docs.microsoft.com/en-us/azure/aks/gpu-cluster

So, I already had K8s cluster, I added new pool:

az aks nodepool add \
--resource-group XXX \
--cluster-name XXX \
--name spotgpu \
--node-vm-size standard_nv12s_v3 \
--node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true \
--enable-cluster-autoscaler \
--node-count 1 \
--min-count 1 \
--max-count 2 \
--max-pods 12 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price 0.2

So, node pool was successfully created:

kubectl get nodes
NAME                                 STATUS   ROLES   AGE   VERSION
...
aks-spotgpu-XXX-XXX      Ready    agent   11m   v1.21.9

After that I applied this Job - https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#run-a-gpu-enabled-workload

But new cant run, it is in Pending state -

Events:
Type     Reason             Age   From                Message
----     ------             ----  ----                -------
Warning  FailedScheduling   76s   default-scheduler   0/12 nodes are available: 1 
node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 8 Insufficient nvidia.com/gpu.
Warning  FailedScheduling   75s   default-scheduler   0/12 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 8 Insufficient nvidia.com/gpu.
Normal   NotTriggerScaleUp  39s   cluster-autoscaler  pod didn't trigger scale-up: 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate

I tried different max/min/node count variants but always got the same warning messages and can`t start the pod.

Where I am wrong?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source