'Cant run GPU pod - 0/12 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}
Trying to create GPU node in my Azure cluster. I am following this instruction - https://docs.microsoft.com/en-us/azure/aks/gpu-cluster
So, I already had K8s cluster, I added new pool:
az aks nodepool add \
--resource-group XXX \
--cluster-name XXX \
--name spotgpu \
--node-vm-size standard_nv12s_v3 \
--node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true \
--enable-cluster-autoscaler \
--node-count 1 \
--min-count 1 \
--max-count 2 \
--max-pods 12 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price 0.2
So, node pool was successfully created:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
...
aks-spotgpu-XXX-XXX Ready agent 11m v1.21.9
After that I applied this Job - https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#run-a-gpu-enabled-workload
But new cant run, it is in Pending state -
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 76s default-scheduler 0/12 nodes are available: 1
node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 8 Insufficient nvidia.com/gpu.
Warning FailedScheduling 75s default-scheduler 0/12 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 8 Insufficient nvidia.com/gpu.
Normal NotTriggerScaleUp 39s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {sku: compute-cpu}, that the pod didn't tolerate, 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate
I tried different max/min/node count variants but always got the same warning messages and can`t start the pod.
Where I am wrong?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
