'How to use onnxruntime parallel with flask?
Created a server that want to run a session of onnxruntime parallel.
First question, will be used multi-threads or multi-processings?
Try to use multi-threads, app.run(host='127.0.0.1', port='12345', threaded=True)
.
When run 3 threads that the GPU's memory less than 8G, the program can run. But when run 4 threads that the GPU's memory will be greater than 8G, the program have error: onnxruntime::CudaCall CUBLAS failure 3: CUBLAS_STATUS_ALLOC_FAILED.
I know that the problem is leaky of GPU's memory. But I hope that the program don't run crash. So I try to limit the number of threads, and set intra_op_num_threads = 2
or inter_op_num_threads = 2
or os.environ["OMP_NUM_THREADS"] = "2"
, but don't work.
Try to 'gpu_mem_limit'
, don't work either.
import onnxruntime as rt
from flask import Flask, request
app = Flask(__name__)
sess = rt.InferenceSession(model_XXX, providers=['CUDAExecutionProvider'])
@app.route('/algorithm', methods=['POST'])
def parser():
prediction = sess.run(...)
if __name__ == '__main__':
app.run(host='127.0.0.1', port='12345', threaded=True)
My understanding is that the Flask HTTP server maybe use different sess for each call. How can make each call use the same session of onnxruntime?
System information
- OS Platform and Distribution: Windows10
- ONNX Runtime version: 1.8
- Python version: python 3.7
- GPU model and memory: RTX3070 - 8G
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|