'Kubernetes GPU Pod error : validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"

When trying to create Pods that can use GPU, I get the error "exec: "nvidia-smi": executable file not found in $PATH" ". To explain the error from the beginning, my main goal was to create JupyterHub enviroments that can use GPU. I installed Zero to JupyterHub for Kubernetes. I followed these steps to be able to use GPU. When I check my nodes GPUs seems schedulable by Kubernetes. So far everything seemed fine.

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'

NAME          GPUs
arge-server   1

However, when I logged in to JupyetHub and tried to open the profile using GPU, I got an error: [Warning] 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. So, I checked the Pods and I found that they were all in the "Waiting: PodInitializing" state.

kubectl get pods -n gpu-operator-resources

NAME                                   READY   STATUS       RESTARTS   AGE
nvidia-dcgm-x5rqs                      0/1     Init:0/1     2          6d20h
nvidia-device-plugin-daemonset-jhjhb   0/1     Init:0/1     0          6d20h
gpu-feature-discovery-pd4xv            0/1     Init:0/1     2          6d20h
nvidia-dcgm-exporter-7mjgt             0/1     Init:0/1     2          6d20h
nvidia-operator-validator-9xjmv        0/1     Init:Error   10         26m

After that, I took a closer look at the Pod nvidia-operator-validator-9xjmv, which was the beginning of the error, and I saw that the toolkit-validation container was throwing a CrashLoopBackOff error. Here is the relevant part of the log:

kubectl describe pod nvidia-operator-validator-9xjmv -n gpu-operator-resources

    Name:                 nvidia-operator-validator-9xjmv
    Namespace:            gpu-operator-resources
        .   
        .
        .
    Controlled By:  DaemonSet/nvidia-operator-validator
    Init Containers:
        .
        .
        .
      toolkit-validation:
        Container ID:  containerd://e7d004f0809cbefdae5407ea42eb659972ea7eefa5dd6e45e968cbf3ed22bf2e
        Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
        Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
        Port:          <none>
        Host Port:     <none>
        Command:
          sh
          -c
        Args:
          nvidia-validator
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       Error
          Exit Code:    1
          Started:      Thu, 18 Nov 2021 12:55:00 +0300
          Finished:     Thu, 18 Nov 2021 12:55:00 +0300
        Ready:          False
        Restart Count:  16
        Environment:
          WITH_WAIT:  false
          COMPONENT:  toolkit
        Mounts:
          /run/nvidia/validations from run-nvidia-validations (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hx7ls (ro)
        .   
        .
        .
    
    Events:
      Type     Reason     Age                   From               Message
      ----     ------     ----                  ----               -------
      Normal   Scheduled  58m                   default-scheduler  Successfully assigned gpu-operator-resources/nvidia-operator-validator-9xjmv to arge-server
      Normal   Pulled     58m                   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
      Normal   Created    58m                   kubelet            Created container driver-validation
      Normal   Started    58m                   kubelet            Started container driver-validation
      Normal   Pulled     56m (x5 over 58m)     kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
      Normal   Created    56m (x5 over 58m)     kubelet            Created container toolkit-validation
      Normal   Started    56m (x5 over 58m)     kubelet            Started container toolkit-validation
      Warning  BackOff    3m7s (x255 over 58m)  kubelet            Back-off restarting failed container

Then, I looked at the logs of the container and I got the following error.

kubectl logs -n gpu-operator-resources -f nvidia-operator-validator-9xjmv -c toolkit-validation

time="2021-11-18T09:29:24Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready

For similar issues, it was suggested to delete the failed Pod and deployment. However, doing these did not fix my problem. Do you have any suggestions?

I have;

  • Ubuntu 20.04
  • Kubernetes v1.21.6
  • Docker 20.10.10
  • NVIDIA-SMI 470.82.01
  • CUDA 11.4
  • CPU: Intel Xeon E5-2683 v4 (32) @ 2.097GHz
  • GPU: NVIDIA GeForce RTX 2080 Ti
  • Memory: 13815MiB / 48280MiB

Thanks in advance.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source