'How to achieve auto scaling with AWS sage maker with single docker image and multiple model artifact?
I am exploring AWS sage maker for deploying the machine learning models with single container/docker image and wants to achieve scalability too.
For your info, we have written the training and prediction code in such a way that so, same code is being used for different customers. Only the ENV variables,VERSION_NUMBER are different according to customer.
VERSION_NUMBER : Refers to S3 folder name as current timestamp which contains latest model artifact.
for e.g.
Customer name : XYZ
docker image(training and prediction) : **intent-mapping:latest**
s3 model atrifact path : s3_bucket/XYZ/VERSION_NUMBER/XYZ.tar.gz
ENV variable : CUSTOMER_SUFFIX,IS_INCREMENTAL
Customer name : PQR
docker image(training and prediction) : **intent-mapping:latest**
s3 model atrifact path : s3_bucket/PQR/VERSION_NUMBER/PQR.tar.gz
ENV variable : CUSTOMER_SUFFIX,IS_INCREMENTAL
If you read it carefully, you will see same docker image is being refereed everywhere
So,my doubts are as follow.
- Which approach should i follow to deploy the model with such a scenario?
- Let's say, if we use multiple model/multi container option of sage maker then how auto scaling will be managed for each customer independently?
Please let me know your inputs on the same.
Solution 1:[1]
There are 2 ways to approach the problem -
- MME ( Multi-model endpoint)
- MCE ( Multi-container endpoint)
More information on advantages and usecase can be found here.
In my opinion, for your usecase since the framework and container remains the same( in fact the model as well) MME is apt.
As far as the sharing of resources is concerned - Multi-model endpoints enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency. When this is the case, multi-model endpoints can effectively use instances across all models. If you have models that have significantly higher transactions per second (TPS) or latency requirements, we recommend hosting them on dedicated endpoints. Multi-model endpoints are also well suited to scenarios that can tolerate occasional cold-start-related latency penalties that occur when invoking infrequently used models.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Raghu Ramesha |
