'Setting up API Gateway throttling for multi-second queries

I want to setup API Throttling correctly for our system. It doesn't seem to be working as intended at the moment.

Here are some constraints we're up against:

  1. Average response latency 10 seconds
  2. Total number of queries a client needs to send for the month - 1M. This comes down to needing to handle 22 queries a minute.

Our goal is to provision just enough K8S pods to handle this traffic (a lot of pods means a lot of cost for us). Due to the high latency per-request, we can only handle so many requests concurrently. We need to throttle what the client sends us, and return 429s to tell them to retry.

Putting this altogether, let's assume I provisioned 4 pods. This means that in some ideal world, I can handle 24 requests in a minute (4 requests in parallel assuming 1 req / pod, 6 sequential requests / minute given each request takes 10s, so 4 x 6 = 24). This will give them 1M Queries for the Month.

Questions:

  1. Please check my understanding of Burst - If I provision the Burst to 4 (assuming number of pods = number of parallel requests), then the maximum number of parallel requests API GW will let through is 4, and when one of those requests completes, it'll give room for the client to retry with a new set of requests. Do I have this right?
  2. When I set an RPS of 0.1 and Burst of 4 (basically refill the token bucket every 10 seconds, and don't allow more than 4 requests at a time), and i hit it with a lot of parallel traffic (like 30 requests / second), I see that API Gateway doesn't return 429s, and I notice that my service also sees all 30 in the second. So the throttling didn't seem to work. What am I doing wrong here? I created a usage plan, attached an API key to that usage plan, and set the usage plans throttling / burst accordingly. I also added the API + Stage as well.


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source