'Preprocessing image for Tesseract OCR with OpenCV

I'm trying to develop an App that uses Tesseract to recognize text from documents taken by a phone's cam. I'm using OpenCV to preprocess the image for better recognition, applying a Gaussian blur and a Threshold method for binarization, but the result is pretty bad.

Here is the the image I'm using for tests:

And here the preprocessed image:

What others filter can I use to make the image more readable for Tesseract?

Solution 1:^[1]

Scanning at 300 dpi (dots per inch) is not officially a standard for OCR (optical character recognition), but it is considered the gold standard.

Converting image to Greyscale improves accuracy in reading text in general.

I have written a module that reads text in Image which in turn process the image for optimum result from OCR, Image Text Reader .

import tempfile

import cv2
import numpy as np
from PIL import Image

IMAGE_SIZE = 1800
BINARY_THREHOLD = 180

def process_image_for_ocr(file_path):
    # TODO : Implement using opencv
    temp_filename = set_image_dpi(file_path)
    im_new = remove_noise_and_smooth(temp_filename)
    return im_new

def set_image_dpi(file_path):
    im = Image.open(file_path)
    length_x, width_y = im.size
    factor = max(1, int(IMAGE_SIZE / length_x))
    size = factor * length_x, factor * width_y
    # size = (1800, 1800)
    im_resized = im.resize(size, Image.ANTIALIAS)
    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.jpg')
    temp_filename = temp_file.name
    im_resized.save(temp_filename, dpi=(300, 300))
    return temp_filename

def image_smoothening(img):
    ret1, th1 = cv2.threshold(img, BINARY_THREHOLD, 255, cv2.THRESH_BINARY)
    ret2, th2 = cv2.threshold(th1, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    blur = cv2.GaussianBlur(th2, (1, 1), 0)
    ret3, th3 = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return th3

def remove_noise_and_smooth(file_name):
    img = cv2.imread(file_name, 0)
    filtered = cv2.adaptiveThreshold(img.astype(np.uint8), 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 41,
                                     3)
    kernel = np.ones((1, 1), np.uint8)
    opening = cv2.morphologyEx(filtered, cv2.MORPH_OPEN, kernel)
    closing = cv2.morphologyEx(opening, cv2.MORPH_CLOSE, kernel)
    img = image_smoothening(img)
    or_image = cv2.bitwise_or(img, closing)
    return or_image

Solution 2:^[2]

Note: this should be a comment to Alex I answer, but it's too long so i put it as answer.

from "An Overview of the Tesseract OCR engine, by Ray Smith, Google Inc." at https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf

"Processing follows a traditional step-by-step pipeline, but some of the stages were unusual in their day, and possibly remain so even now. The first step is a connected component analysis in which outlines of the components are stored. This was a computationally expensive design decision at the time, but had a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text. Tesseract was probably the first OCR engine able to handle white-on-black text so trivially."

So it seems it's not needed to have black text on white background, and should work the opposite too.

Solution 3:^[3]

You can play around with the configuration of the OCR by changing the --psm and --oem values, in your case specifically I will suggest using

--psm 3 --oem 2

you can also look at the following link for further details here

Solution 4:^[4]

I guess you have used the generic approach for Binarization, that is the reason whole image is not binarized uniformly. You can use Adaptive Thresholding technique for binarization. You can also do some skew correction, perspective correction, noise removal for better results.

Refer to this medium article, to know about the above-mentioned techniques along with code samples.

Solution 5:^[5]

For wavy text like yours there is this fantastic Python code on GitHub, which transforms the text to straight lines: https://github.com/tachylatus/page_dewarp.git (this is the most updated version of MZucker's original post and the mechanics are explained here:https://mzucker.github.io/2016/08/15/page-dewarping.html)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Community
Solution 2	AndrewBloom
Solution 3	sameer maurya
Solution 4	Yunus Temurlenk
Solution 5	steca

'Preprocessing image for Tesseract OCR with OpenCV

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Solution 5:[5]