'How to use onnxruntime parallel with flask?

Created a server that want to run a session of onnxruntime parallel.

First question, will be used multi-threads or multi-processings?

Try to use multi-threads, app.run(host='127.0.0.1', port='12345', threaded=True).
When run 3 threads that the GPU's memory less than 8G, the program can run. But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.

I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set intra_op_num_threads = 2 or inter_op_num_threads = 2 or os.environ["OMP_NUM_THREADS"] = "2", but don't work. Try to 'gpu_mem_limit', don't work either.

import onnxruntime as rt
from flask import Flask, request
app = Flask(__name__)

sess = rt.InferenceSession(model_XXX, providers=['CUDAExecutionProvider'])

@app.route('/algorithm', methods=['POST'])
def parser():
    prediction = sess.run(...)

if __name__ == '__main__':
    app.run(host='127.0.0.1', port='12345', threaded=True)

My understanding is that the Flask HTTP server maybe use different sess for each call. How can make each call use the same session of onnxruntime?

System information

OS Platform and Distribution: Windows10
ONNX Runtime version: 1.8
Python version: python 3.7
GPU model and memory: RTX3070 - 8G

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to use onnxruntime parallel with flask?

Sources

Related Questions