'Problems routing to GKE when using an external proxy to the internet
I seem to be having a problem with the routes to and from a site beyond a corporate egress proxy when requesting from a pod in a rather vanilla GKE installation. I have tried this on GKE clusters both with and without an istio mesh.
Here are my observations.
From a VM on the same GCP subnet where the GKE instance is, I can run the following and get the expected response:
$ https_proxy=$PROXY_DNS:$PROXY_PORT curl https://en.wikipedia.org/wiki/Main_Page | grep -o "<title>.*</title>"
<title>Wikipedia, the free encyclopedia</title>
If I ssh to one of the nodes of the GKE instance, I can run the same request and get the expected response as well.
However, if I get a shell on a pod within the GKE instance (with the istio mesh in this case) and run the following curl with verbose output, I see:
$ HTTPS_PROXY=$PROXY_DNS:$PROXY_PORT curl -v https://en.wikipedia.org/wiki/Main_Page | grep -o "<title>.*</title>"
* Uses proxy env variable HTTPS_PROXY == '<PROXY_DNS value>:<PROXY_PORT value>'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying <IP of PROXY_DNS>:<PROXY_PORT value>...
* Connected to <PROXY_DNS value> (<IP of PROXY_DNS>) port <PROXY_PORT value> (#0)
* allocate connect buffer!
* Establish HTTP proxy tunnel to en.wikipedia.org:443
> CONNECT en.wikipedia.org:443 HTTP/1.1
> Host: en.wikipedia.org:443
> User-Agent: curl/7.82.0-DEV
> Proxy-Connection: Keep-Alive
>
0 0 0 0 0 0 0 0 --:--:-- 0:00:09 --:--:-- 0* Recv failure: Connection reset by peer
* Received HTTP code 0 from proxy after CONNECT
* CONNECT phase completed!
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer
In this case, it appears that the connection to wikipedia is successful through the proxy, but no return is received.
If I get a shell on a pod within the GKE instance (without the istio mesh in this case) and run the same curl, I see the following:
# HTTPS_PROXY=$PROXY_DNS:$PROXY_PORT curl -v https://en.wikipedia.org/wiki/Main_Page | grep -o "<title>.*</title>"
* Uses proxy env variable HTTPS_PROXY == '<PROXY_DNS value>:<PROXY_PORT value>'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying <IP of PROXY_DNS>...
* TCP_NODELAY set
0 0 0 0 0 0 0 0 --:--:-- 0:02:10 --:--:-- 0* connect to <IP of PROXY_DNS> port <PROXY_PORT value> failed: Connection timed out
* Failed to connect to <PROXY_DNS value> port <PROXY_PORT value>: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to <PROXY_DNS value> port <PROXY_PORT value>: Connection timed out
In this case, it appears that the curl cannot reach the proxy.
Note: a curl -v https://en.wikipedia.org/wiki/Main_Page | grep -o "<title>.*</title>" without the proxy environment variable settings succeeds from the pods in both GKE clusters.
So, it seems for the cluster with the istio mesh, I have a routing problem on the return path from the proxy. In the cluster without the istio mesh, I am not even able to route to the proxy. Since I can successfully use the proxy from the nodes on either cluster and a vanilla VM on that GCP subnet, I have some problem in the hops between the cluster and the nodes on which they run, but I am at a loss as to what is wrong.
I am looking for solutions to the routing issue from the GKE cluster through to the proxy as I suspect it is just some problem in the setup of my cluster.
tl;dr;
Here is the details for th setup. I have established a VPN from a project in GCP to AWS and set up a VPC proxy in AWS by following a combination of these two posts:
- How to set up an outbound VPC proxy with domain whitelisting and content filtering
- Build HA VPN connections between Google Cloud and AWS
The setup works well as shown by the successful use of the proxy from the vanilla VM on the GCP subnet and the successful use of the proxy from the nodes of the GKE instances. I take from these results that the routing from the subnet down the VPN and through the (squid) proxies is working.
I first set up the vanilla GKE instance and tried following a few posts to get this to work including:
- GKE docker registry with HTTP proxy
- How to set proxy settings (http_proxy variables) for kubernetes (v1.11.2) cluster?
After a fair amount of working with those posts, I still got the results above from that cluster. i.e. not able to reach the proxy.
Then I tried setting up another GKE cluster and using an istio mesh. I followed the following post:
This post demonstrates the use of a Squid proxy in a different namespace but on the same Kubernetes (GKE in my case) cluster and routing to it 'externally'. I started with this and was able to use the proxy in the external namespace from the default namespace as described in the post. Once I demonstrated that I could use the proxy in the external namespace, I tried to use the proxy set up in AWS on the other side of the VPN. The requests got further, as recorded above, but the return path, seemingly from the node to the pod, did not make it.
Solution 1:[1]
We have found a solution to the above problem. The GKE cluster(s) we had set up did not do IP Masquerading by default for the pod IP addresses when they left the node for the wider network. Information on an IP Masquerade Agent in GKE may be found here. The configuration of the agent is described here. (NOTE this is for regular clusters, auto pilot clusters have a specific configuration page, but we were not following this, so YMMV.)
We were able to use the given daemonset example without changes and in the configmap, we set the nonMasqueradeCIDRs to an array of the pod range and the range for the cluster's subnet.
Once we applied the config map and the daemonset, workloads on the pods could use the proxy settings and be routed through the proxy to the target site and back successfully. In addition, if the proxy settings are not used, the requests would route out the GCP internet gateway, which is available from the subnetwork on which the GKE cluster exists in our case.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Kevin O'Connor |
