'nvidia-smi gives an error inside of a docker container

  • Sometimes I can't communicate with my Nvidia GPUs inside a docker container when I came back to my workplace from home, even though the previously launched process that utilizes GPUs is running well. The running process (training a neural network via Pytorch) is not affected by the disconnection but I cannot launch a new process.

  • nvidia-smi gives Failed to initialize NVML: Unknown Error and torch.cuda.is_available() returns False likewise.

  • I met two different cases:

    1. nvidia-smi works fine when it is done at the host machine. In this case, the situation can be solved by restarting the docker container via docker stop $MYCONTAINER followed by docker start $MYCONTAINER at the host machine.
    1. nvidia-smi doesn't work at the host machine nor nvcc --version, throwing Failed to initialize NVML: Driver/library version mismatch and Command 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkit error. Strange point is that the current process still runs well. In this case, installing the driver again or rebooting the machine solves the problem.
  • However, these solutions require stopping all current processes. It would be unavailable when I should not stop the current process.

Does somebody has suggestion for solving this situation?

Many thanks.

(sofwares)

  • Docker version: 20.10.14, build a224086
  • OS: Ubuntu 22.04
  • Nvidia driver version: 510.73.05
  • CUDA version: 11.6

(hardwares)

  • Supermicro server
  • Nvidia A5000 * 8

  • (pic1) nvidia-smi not working inside of a docker container, but worked well on the host machine. enter image description here

  • (pic2) nvidia-smi works after restarting a docker container, which is the case 1 I mentioned above enter image description here



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source