'Htop cpu bar red, 100% kernel time
I found some similar topics but no helpful solution was found. Since I have some more information to provide, I opened this issue.
My PyTorch script frequently gets stuck on a training server.
Htop shows that there is only one green CPU bar while other active cores are almost 100% red. According to the F1 explanation, red means kernel time.

Whenever this 100% red CPU bar occurs, the training gets stuck and GPU-util drops down to 0%. Wired thing is this only happens on two of the servers I use. It never happens on my PC (less powerful) and never happens on another powerful server.
The strace command shows that when the problem occurs, there will be many
futex(0x55bbb0e82db0, FUTEX_WAKE_PRIVATE, 1) = 0
Any explanation on what the problem is and how to avoid this. Or any further information to provide?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

