'Unable to attach or mount volumes on pods

We noticed that these errors started after the node-pool was autoscaled and the existing Nodes were replaced with new compute instances. This also happened during a maintenance window. We're using an NFS server. GCP Kubernetes Cluster version is 1.21.6

The issue appears to have only affect certain Nodes on the cluster. We've cordoned the Nodes where the mount error's and pods on the "healthy" nodes are working.

"Unable to attach or mount volumes: unmounted volumes=[vol],
unattached volumes=[vol]: timed out waiting for the condition"

We're also seeing errors on the konnectivity-agent:

"connection read failure" err="read tcp
10.4.2.34:43682->10.162.0.119:10250: use of closed network connection"

We believe the issue is when autoscaling is enabled, and new Nodes are introduced to the pool. The problem is it appears to be completely random. Sometimes the pods come up fine and others get the mount error.



Solution 1:[1]

This Error indicates that the NFS workload would be stuck at a Terminating state. Moreover, some disk throttling might be observed on worker nodes.

Solution:

There are two possible workarounds for this issue.

  • Force Deletion of the NFS workload can sometimes mitigate the issue. After deletion, you may also need to restart the Kubelet of the worker node.

  • NFS versions v4.1 and v4.2 shouldn't be affected by this issue. The NFS version is specified via configuration and doesn't require an image change.

Please change the NFS version as seen below:

mountOptions:



 - nfsvers=4.2

Cause

NFS v4.0 has a known issue which is a limitation of the NFS pod when there are too many connections at once. Because of that limitation, the Kubelet can't unmount the NFS volume from the pod and the worker node if the NFS container was deleted first. Moreover, NFS mounts going stale when the server dies is a known issue. It is known that many of these stale mounts building upon a worker node can cause future NFS mounts to slow down. Please note that there can also be some unmount issue related to this error.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1