'How to make TFLite handle multiple api calls at simultaneously?
I am performing Cloud Cost Optimization on AWS and am currently exploring deployment of TFLite models on AWS t4g instances which use AWS custom ARM chip - Graviton2. They pack ~2x performance for 1/2 cost when compared to t3 instances. I have created a basic deployment container for InceptionNet-v3 pretained on Image-Net (no fine-tuning) using FastApi.
InceptionNet-v3 tflite model is quantized to float16. I have shared the deployment code.
import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image
import copy
def load_model(model_path):
'''
Used to load the model as a global variable
'''
model_interpreter = tflite.Interpreter(model_path)
return model_interpreter
def infer(image, model_interpreter):
'''
Runs inference on the input image
'''
model_interpreter.allocate_tensors()
input_details = model_interpreter.get_input_details()[0]
output_details = model_interpreter.get_output_details()[0]
input = model_interpreter.tensor(input_details["index"])
output = model_interpreter.tensor(output_details["index"])
image = image.resize(input_details["shape"][1:-1])
image = np.asarray(image, dtype=np.float32)
image = np.expand_dims(image, 0)
image = image/255
input()[:] = image
model_interpreter.invoke()
results = copy.deepcopy(output())
return results
When I run my load testing program (written using Locust - a python load testing package), then I don't see any error till the user count breaches '3'. After that I start seeing Runtime error given below:
RuntimeError: There is at least 1 reference to internal data
in the interpreter in the form of a numpy array or slice. Be sure to
only hold the function returned from tensor() if you are using raw
data access.
I don''t see the error for every request. It is random in nature.
When I placed the model loading codelines inside infer function. I stopped seeing this error. But, this will load a new model everytime. This is fine while the model size is less. But, after sometime, this will contribute a good amount towards the inference time.
Is there a way to solve this random error while having the TFLite model as a global variable?
Thanks.
EDIT:
I tried out signature runners api. I didn't get the error but, for some reason my docker container kept on exiting without any error. I think, it was because of lack of memory.
Does signature runner create multiple interpreter instances in different threads when the current intrepreter is being run thus using up all the available RAM?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
