'How to make TFLite handle multiple api calls at simultaneously?

I am performing Cloud Cost Optimization on AWS and am currently exploring deployment of TFLite models on AWS t4g instances which use AWS custom ARM chip - Graviton2. They pack ~2x performance for 1/2 cost when compared to t3 instances. I have created a basic deployment container for InceptionNet-v3 pretained on Image-Net (no fine-tuning) using FastApi.

InceptionNet-v3 tflite model is quantized to float16. I have shared the deployment code.

import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image
import copy

def load_model(model_path):
    '''
    Used to load the model as a global variable
    '''
    model_interpreter = tflite.Interpreter(model_path)   
    return model_interpreter

def infer(image, model_interpreter):
    '''
    Runs inference on the input image
    '''
    model_interpreter.allocate_tensors()
    input_details = model_interpreter.get_input_details()[0]
    output_details = model_interpreter.get_output_details()[0]
    input = model_interpreter.tensor(input_details["index"])
    output = model_interpreter.tensor(output_details["index"])
 
    image = image.resize(input_details["shape"][1:-1])
    image = np.asarray(image, dtype=np.float32)
    image = np.expand_dims(image, 0)
    image = image/255
   
    input()[:] = image
    model_interpreter.invoke()
    results = copy.deepcopy(output())
    
    return results

When I run my load testing program (written using Locust - a python load testing package), then I don't see any error till the user count breaches '3'. After that I start seeing Runtime error given below:

RuntimeError: There is at least 1 reference to internal data
      in the interpreter in the form of a numpy array or slice. Be sure to
      only hold the function returned from tensor() if you are using raw
      data access.

I don''t see the error for every request. It is random in nature.

When I placed the model loading codelines inside infer function. I stopped seeing this error. But, this will load a new model everytime. This is fine while the model size is less. But, after sometime, this will contribute a good amount towards the inference time.

Is there a way to solve this random error while having the TFLite model as a global variable?

Thanks.

EDIT: I tried out signature runners api. I didn't get the error but, for some reason my docker container kept on exiting without any error. I think, it was because of lack of memory.

Does signature runner create multiple interpreter instances in different threads when the current intrepreter is being run thus using up all the available RAM?

machine-learning tensorflow2.0 tensorflow-lite

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to make TFLite handle multiple api calls at simultaneously?

Sources

Related Questions