'PyTorch code stops with message "Killed". What killed it?
I train a network on GPU with Pytorch. However, after at most 3 epochs, code stops with a message :
Killed
No other error message is given. I monitored the memory and gpu usage, there were still space during the run. I reviewed the /var/sys/dmesg to find a detailed message regarding to this, however no message with "kill" was entered. What might be the problem?
Cuda version: 9.0 Pytorch version: 1.1.0
Solution 1:[1]
If you had root access, you could check whether this is memory issue or not by dmesg command.
In my case, the process was killed by kernel due to out of memory.
I found the cause to be saving tensors require grad to a list and each of those stores an entire computation graph, which consumes significant memory.
I fixed the issue by saving .detach() tensor instead of saving tensors returned by loss function to the list.
Solution 2:[2]
You can type "dmesg" on your terminal and scroll down to the bottom. It will show you the message of why it is killed.
Since you mentioned PyTorch, the chances are that your process is killed due to "Out of Memory". To resolve this, reduce your batch size till you no longer see the error.
Hope this helps! :)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Eisuke Mizutani |
| Solution 2 | Risab Biswas |
