'Kubernetes GPU Pod error : validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
When trying to create Pods that can use GPU, I get the error "exec: "nvidia-smi": executable file not found in $PATH" ". To explain the error from the beginning, my main goal was to create JupyterHub enviroments that can use GPU. I installed Zero to JupyterHub for Kubernetes. I followed these steps to be able to use GPU. When I check my nodes GPUs seems schedulable by Kubernetes. So far everything seemed fine.
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME GPUs
arge-server 1
However, when I logged in to JupyetHub and tried to open the profile using GPU, I got an error: [Warning] 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. So, I checked the Pods and I found that they were all in the "Waiting: PodInitializing" state.
kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-x5rqs 0/1 Init:0/1 2 6d20h
nvidia-device-plugin-daemonset-jhjhb 0/1 Init:0/1 0 6d20h
gpu-feature-discovery-pd4xv 0/1 Init:0/1 2 6d20h
nvidia-dcgm-exporter-7mjgt 0/1 Init:0/1 2 6d20h
nvidia-operator-validator-9xjmv 0/1 Init:Error 10 26m
After that, I took a closer look at the Pod nvidia-operator-validator-9xjmv, which was the beginning of the error, and I saw that the toolkit-validation container was throwing a CrashLoopBackOff error. Here is the relevant part of the log:
kubectl describe pod nvidia-operator-validator-9xjmv -n gpu-operator-resources
Name: nvidia-operator-validator-9xjmv
Namespace: gpu-operator-resources
.
.
.
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
.
.
.
toolkit-validation:
Container ID: containerd://e7d004f0809cbefdae5407ea42eb659972ea7eefa5dd6e45e968cbf3ed22bf2e
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 18 Nov 2021 12:55:00 +0300
Finished: Thu, 18 Nov 2021 12:55:00 +0300
Ready: False
Restart Count: 16
Environment:
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hx7ls (ro)
.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58m default-scheduler Successfully assigned gpu-operator-resources/nvidia-operator-validator-9xjmv to arge-server
Normal Pulled 58m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 58m kubelet Created container driver-validation
Normal Started 58m kubelet Started container driver-validation
Normal Pulled 56m (x5 over 58m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 56m (x5 over 58m) kubelet Created container toolkit-validation
Normal Started 56m (x5 over 58m) kubelet Started container toolkit-validation
Warning BackOff 3m7s (x255 over 58m) kubelet Back-off restarting failed container
Then, I looked at the logs of the container and I got the following error.
kubectl logs -n gpu-operator-resources -f nvidia-operator-validator-9xjmv -c toolkit-validation
time="2021-11-18T09:29:24Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready
For similar issues, it was suggested to delete the failed Pod and deployment. However, doing these did not fix my problem. Do you have any suggestions?
I have;
- Ubuntu 20.04
- Kubernetes v1.21.6
- Docker 20.10.10
- NVIDIA-SMI 470.82.01
- CUDA 11.4
- CPU: Intel Xeon E5-2683 v4 (32) @ 2.097GHz
- GPU: NVIDIA GeForce RTX 2080 Ti
- Memory: 13815MiB / 48280MiB
Thanks in advance.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
