'gnu parallel vs multiprocessing
What is the difference between GNU-parallel and the python multiprocessing module? And also which one would be ideal for which circumstances, if they differ in usage.
I am trying to parallelize tesseract and found someone recommending GNU parallel here on the tesseract issues.
I want to understand, which one would be ideal for which use case, before going with one over the other.
Also, I'm not getting the desired results from gnu-parallel, I can see 4 processes running in top, but they take a lot more time than python multiprocessing.
1)For gnu-parallel:
Time taken is 8 min 40s as can be seen here
I am using the following command:
ls image*.jpg | time parallel tesseract {} stdout -l hin
top output is here
2) Normal Tesseract using multipage feature.
$ time tesseract imagelist.txt stdout -l hin
Speed can be seen here
3) Multiprocessing based pytesseract
gives a greater speed increase to about 4-5 seconds.
My pdf can be found here
I am using convert_from_path from pdf2image or convert from imagemagick to convert pdf to image as either png or jpeg.
Solution 1:[1]
In python multiprocessing you have possibility of communication between processes. But then you have cost of synchroniazation. gnu pararell runs command multiple time with different parameters. Then You need aggregate results in other process.
Solution 2:[2]
The problem is analyzed here: https://github.com/tesseract-ocr/tesseract/issues/3109
Solution:
export OMP_THREAD_LIMIT=1
before running tesseract in parallel.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Grzegorz Bokota |
| Solution 2 | Ole Tange |
