'How to handle single large request spike to SageMaker endpoint

I know SageMaker endpoints have autoscaling as an option, but from my understanding that mainly applies when there is a sustained high request volume. We have the issue that on occasion there will be a huge sudden single spike and then go back to normal. Is autoscaling fast enough (what's the delay?) for it to handle that? Or does it need to actually spin up another instance? Would having two instances at the endpoint help it respond immediately to an isolated request spike? I'm just not clear on what the response delay is for autoscaling, and I have do not see this mentioned in their posts/documentation. Thanks



Solution 1:[1]

for your question about time sagemaker need to start new instance is that the time can vary depending on the model size, how long it takes to download the model, and the start-up time of the container.

now what options you have :

  1. use hw utilization let say when reach 76% of cpu scale out
  2. another option to use step scale and for example use OverheadLatency to scale out

here is a good resource for the above https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/ and you can always do load testing before you choice the best strategy that fit your need , check this url for load testing https://aws.amazon.com/blogs/machine-learning/load-test-and-optimize-an-amazon-sagemaker-endpoint-using-automatic-scaling/

also , i think using aws sagemaker serverless is good solution for your case but it is in preview stage, right now https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-serverless-inference/

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1