'How to use onnxruntime parallel with flask?

Created a server that want to run a session of onnxruntime parallel.

First question, will be used multi-threads or multi-processings?

Try to use multi-threads, app.run(host='127.0.0.1', port='12345', threaded=True).
When run 3 threads that the GPU's memory less than 8G, the program can run. But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.

I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set intra_op_num_threads = 2 or inter_op_num_threads = 2 or os.environ["OMP_NUM_THREADS"] = "2", but don't work. Try to 'gpu_mem_limit', don't work either.

import onnxruntime as rt
from flask import Flask, request
app = Flask(__name__)

sess = rt.InferenceSession(model_XXX, providers=['CUDAExecutionProvider'])

@app.route('/algorithm', methods=['POST'])
def parser():
    prediction = sess.run(...)

if __name__ == '__main__':
    app.run(host='127.0.0.1', port='12345', threaded=True)

My understanding is that the Flask HTTP server maybe use different sess for each call. How can make each call use the same session of onnxruntime?

System information

  • OS Platform and Distribution: Windows10
  • ONNX Runtime version: 1.8
  • Python version: python 3.7
  • GPU model and memory: RTX3070 - 8G


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source