'How does Kubernetes kubelet resource reservation work

I recently tried to bring up a Kubernetes cluster in AWS using kops. But when the worker node (Ubuntu 20.04) started, a docker load process on it kept getting OOMkilled even when it has enough memory (~14GiB). I tracked down the issue being I set kubelet's memory reservation too small (--kube-reserved=memory=100Mi...).

So now I have two questions related to the following paragraph in the documentation:

kube-reserved is meant to capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc.

https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#kube-reserved

First, I interpreted the "reservation" as "the amount of memory guaranteed", similar to the concept of a pod's .spec.resource.requests.memory. However, it seems like the flag acts like a limit as well? Does this mean Kubernetes intend to manage Kubernetes system daemons with "guaranteed" QoS class concept?

Also, my container runtime, docker, does not seem to be in /kube-reserved cgroup, instead, it is in /system.slice:

$ systemctl status $(pgrep dockerd) | grep CGroup
     CGroup: /system.slice/docker.service

So why is it getting limited by /kube-reserved? It is not even kubelet talking to docker through CRI, but just my manual docker load command.



Solution 1:[1]

kube-reserved is a way to protect Kubernetes system daemons (which includes the Kubelet) from running out of memory should the pods consume too much. How is this achieved? The pods are limited by default to an "allocatable" value, equal to the memory capacity of the node minus several flag values defined in the URL you posted, one of which is kube-reserved. Here's what this looks like for a 7-GiB DS2_v2 node in AKS:

Node Allocatable and node memory capacity distribution for a 7-GiB DS2_v2 AKS node

But it's not always the Kubernetes system daemons that have to be protected from either pods or even OS components consuming too much memory. It can very well be the Kubernetes system daemons that could consume too much memory and start affecting the pods or other OS components. To protect against this scenario, there's an additional flag defined:

To optionally enforce kube-reserved on kubernetes system daemons, specify the parent control group for kube daemons as the value for --kube-reserved-cgroup kubelet flag.

With this new flag in place, should the aggregated memory use of the Kubernetes system daemons exceed the cgroup limit, then the OOM killer will step in and terminate one of their processes. To apply this to the picture above, with the --kube-reserved-cgroup flag specified, the Kubernetes system daemons are prevented from going over 1,638 MiB.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mihai Albert