'OCR with Tesseract for grayscale, tab-delimited, table of numbers
What I thought would be a fairly simple OCR task is turning out to be more challenging.
Here is the original image, which is a grayscale table of numbers that is tab-delimited (most likely):
And here is my attempt:
from pathlib import Path
from PIL import Image
import pytesseract
import cv2
image_path = Path("table.png")
img = cv2.imread(str(image_path))
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
cv2.imwrite(str(Path("removed_noise1.png")), img)
_, img = cv2.threshold(img, 220, 255, cv2.THRESH_BINARY)
cv2.imwrite(str(Path("thresh1.png")), img)
img = cv2.resize(img, (2500, 1200))
cv2.imwrite(str(Path("resize1.png")), img)
custom_config = r"--psm 6 -c tessedit_char_whitelist=0123456789"
text = pytesseract.image_to_string(Image.open(str(Path("resize1.png"))), config=custom_config)
print(text)
>>>
18 25 8 19 6 7 5 11 2 1 0 0 0 1 2 2 1 3 4 3
58 37 45 5942 4441 50 25 2 3 4 1 1 2 2 1 3 4 3
20 15 32 18 33 32 34 26 31 6 14 13 7 2 2 2 3 3 4 3
4 3 11 3 13 12 13 9 20 9 17 17 12 5 3 4 6 3 4 4
0 0 3 1 4 4 4 3 11 9 15 15 13 8 5 5 7 3 4 4
0 1 0 1 1 2 1 5 9 12 12 13 9 6 6 8 4 4 4
0 0 0 0 1 0 1 0 3 8 9 9 11 9 7 6 8 4 4 5
0 0 0 0 0 0 0 0 2 7 7 7 9 9 7 7 8 5 4 5
0 0 0 0 0 0 0 1 6 5 5 7 8 7 7 8 5 4 5
0 0 0 0 0 0 0 0 0 6 4 4 6 7 7 6 7 5 4 5
0 0 0 0 0 0 1 0 0 5 3 3 5 7 6 6 6 5 4 5
0 0 0 0 0 0 0 0 0 5 2 2 4 6 6 6 6 5 5 5
10 0 0 0 0 0 0 0 4 2 2 3 5 5 3 5 5 5 5
0 0 0 0 0 0 0 0 0 4 1 2 2 4 5 5 4 5 5 5
0 0 0 0 0 0 0 0 0 3 1 2 2 4 5 4 5 5 5
0 0 0 0 0 10 0 0 3 1 1 1 3 4 4 3 5 4 4
0 0 0 0 0 0 0 0 0 3 1 1 3 4 4 3 5 4 4
0 0 0 0 0 0 0 0 0 2 1 1 1 2 3 4 3 4 4 4
0 0 0 0 0 0 10 0 0 2 0 0 1 2 3 3 2 4 4 4
0 0 0 0 0 0 0 0 0 2 0 0 1 2 3 3 2 4 4 4
0 0 0 0 0 0 0 0 0 2 0 0 0 1 3 3 2 4 4 4
0 0 0 0 0 0 10 0 0 1 0 0 0 1 2 2 1 4 4 4
0 0 0 0 0 0 0 0 0 1 0 0 0 1 2 2 1 4 4 3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 3 4 3
A lot of zeros are being turned into tens, and for example the second row, second column number (57) is processed as 37.
I've tried different dimensions to no avail, but I obviously haven't exhausted all possibilities.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|