'nvidia-smi gives an error inside of a docker container
Sometimes I can't communicate with my Nvidia GPUs inside a docker container when I came back to my workplace from home, even though the previously launched process that utilizes GPUs is running well. The running process (training a neural network via Pytorch) is not affected by the disconnection but I cannot launch a new process.
nvidia-smigivesFailed to initialize NVML: Unknown Errorandtorch.cuda.is_available()returnsFalselikewise.I met two different cases:
-
nvidia-smiworks fine when it is done at the host machine. In this case, the situation can be solved by restarting the docker container viadocker stop $MYCONTAINERfollowed bydocker start $MYCONTAINERat the host machine.
-
nvidia-smidoesn't work at the host machine nornvcc --version, throwingFailed to initialize NVML: Driver/library version mismatchandCommand 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkiterror. Strange point is that the current process still runs well. In this case, installing the driver again or rebooting the machine solves the problem.
However, these solutions require stopping all current processes. It would be unavailable when I should not stop the current process.
Does somebody has suggestion for solving this situation?
Many thanks.
(sofwares)
- Docker version: 20.10.14, build a224086
- OS: Ubuntu 22.04
- Nvidia driver version: 510.73.05
- CUDA version: 11.6
(hardwares)
- Supermicro server
- Nvidia A5000 * 8
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|


