'intermittent 502 bad gateway
Backstory First:
We have a deployment running that encounters intermittent 502s when trying to load test it with something like JMeter. It's a container that logs POST data to a mysql DB on another container. It handles around 85 requests per second pretty well, with no to minimal errors in Jmeter, however once this number starts increasing the error rate starts to increase too. The errors come back as 502 bad gateways in the response to jmeter:
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
Now the interesting - or, rather confusing - part here is that this appears to be a NGINX error - we don't use NGINX for our ingress at all. It's all through IBM Cloud Bluemix etc.
We've deduced so far that these 502 errors occur when the request from Jmeter that returns this error does not actually hit our main.py script running on the container - there's no log of these errors at the pod level (using kubectl logs -n namespace deployment). Is there any way to intercept/catch errors that basically don't make it into the pod? So we can at least control what message a client gets back in case of these failures?
Solution 1:[1]
I assume the setup is Ingress --> Service --> Deployment. From here https://cloud.ibm.com/docs/containers?topic=containers-ingress-types I conclude you are using nginx ingress controller since there is no mention of a custom ingress controller/ingress class being used.
The 502 appear only above 85 req/sec so the Ingress/Service/Deployment k8s resources are configured correctly...there should be no need to check your service endpoints and ingress configuration.
Please see below some troubleshooting tips for intermittent 502 errors from the ingress controller:
- the Pods may not cope with the increase load (this might not apply to you since 85 req/sec is pretty low, also you said
kubectl get podsshows 0 RESTARTS, but it may be useful to others):- the pods hit memory/cpu limits if you have them configured, check for pod status OOMKilled for example in
kubectl get pods; also do akubectl describeon your pods/deploymet/replicaset and check for any errors - the pods may not respond to Liveness Probe and the pod will get restarted, and you will see 502; do a
kubectl describe svc <your service> | grep Endpointsand check if you have any backend pods Ready for your service - the pods may not respond to Readiness Probe, in which case they will not be eligible as backend pods for your Service, again when you start seeing the 502 check if there are any Endpoints for the Service
- the pods hit memory/cpu limits if you have them configured, check for pod status OOMKilled for example in
- Missing readiness probe: your pod will be considered Ready and become available as an Endpoint for your Service even though the application has not started yet. But this would mean getting the 502 only at the beginning of your jmeter test...so I guess this does not apply to your use case
- Are you scaling automatically? When the increases load does another pod start maybe without a readiness probe?
- Are you using Keep Alive in Jmeter? You may run out of file descriptors because you are creating too many connections, however I don't see this resulting in 502, but it is still worth checking ...
- The ingress controller itself cannot handle the traffic (at 85 req/sec this is hard to imagine, but adding it for the sake of completeness)
- if you have enough permissions you can do a
kubectl get nsand look for the namespace containing the ingress controller,ingress-nginxor something similar. Look for pod restarts or other events in that namespace.
- if you have enough permissions you can do a
- If none of the above points help continue your investigation, try other things, look for clues:
- Try to better isolate the issue, use
kubectl port-forwardinstead of going through ingress. Can you inject more 85 req/sec ? If yes, then your Pods can handle the load and you have isolated the issue to the ingress controller. - Try to start more replicas of your Pods
- Use Jmeter Throughput Shaping Timer Plugin and increase the load gradually; then monitoring what happens to your Service and Pods as the load increases, maybe you can find the exact trigger for the 502 and get more clues as to what could be the root cause
- Try to better isolate the issue, use
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
