'Which are the best practices for Tesseract OCR on low quality images?

I've been working for a while on an OCR solution for my business and I can't seem to get the catch of image filtering for low quality images. The balance between removing the noise and not breaking the characters is genuinely complicated.

What's the issue?

More specifically, this is the kind of text image that I work with. Character code to recognize

And this would be the result after cleaning as much as I can. Character code after cleaning

I'm using Python. When I pass this to Tesseract using --oem 3 --psm 8 the result is SC454B1TAC, which is not that bad, but I think the image should be good enough to get the characters red correctly.

What Am I Doing?

For the moment, the filtering that I perform goes like this:

  1. Change the image to black and white
  2. Get a threshold image with a gaussian filter applied to it
  3. Remove the dark band on the bottom
  4. Dilate and erode the image to remove spots
  5. Get the connected components of the resulting image to close gaps
  6. Give the image to Tesseract and print the result

Here is some code, I hope it's clear enough:

    # Remove dark band
    def remove_band(self, image):
        col1 = [row[0] for row in image] # First column
        col2 = [row[2] for row in image] # Second column
        col3 = [row[3] for row in image] # Third column
        for i, c in enumerate(zip(col1, col2, col3)):
            if c[0] == 0 and c[1] == 0 and c[2] == 0:
                image[i] = 255
        return image

    # Tesseract func
    def print_text(self, rotated):

            # Get OCR output using Pytesseract
            # NOTE: We are using Tesseract 5. If you use Tesseract 4, white/blacklisting doesn't work. Also the algorithm is worse.
            # Installation guide: https://ubuntuhandbook.org/index.php/2021/12/install-tesseract-ocr-5-ubuntu/
            custom_config = '--oem ' + str(self.oem) + ' --psm ' + str(self.psm) \
                + ' -c tessedit_char_whitelist=01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ -c tessedit_char_blacklist=.-}{abcdefghijklmnopqrstuvwxyz'
            out = pytesseract.image_to_string(rotated, config=custom_config, lang=self.lang)

            return out

    # Perform all steps for OCR
    def perform_ocr(self, image):

        # Convert to grayscale
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

        # Binary threshold with gaussian filter
        thresh = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,17,5)

        # Remove the dark band
        noband = self.remove_band(thresh)
        
        # Dilate and erode
        kernel_d = np.ones((2,2), np.uint8)
        kernel_e = np.ones((2,2), np.uint8) 
        img_dilation = cv2.dilate(noband, kernel_d, iterations=2)
        img_erosion = cv2.erode(img_dilation, kernel_e, iterations=2)

        # Get the connected components
        num_comps, labeled_pixels, comp_stats, comp_centroids = \
        cv2.connectedComponentsWithStats(img_erosion, connectivity=4)
        min_comp_area = 10 # pixels
        # get the indices/labels of the remaining components based on the area stat
        # (skip the background component at index 0)
        remaining_comp_labels = [i for i in range(1, num_comps) if comp_stats[i][4] >= min_comp_area]
        # filter the labeled pixels based on the remaining labels, 
        # assign pixel intensity to 255 (uint8) for the remaining pixels
        noiseless = np.where(np.isin(labeled_pixels,remaining_comp_labels)==True,255,0).astype('uint8')

        # Save image
        final_save_file = os.path.join(self.final_save_path, 'final_' + str(self.img_num) + ".jpg")
        cv2.imwrite(final_save_file, noiseless)

        # Get Tesseract result
        out = self.print_text(noiseless)

        return out

As you can see there are a lot of parameters that are manually tuned and could be changed. I've played with them and these give the best results so far.

How can you help?

Can you give me some advice on how to improve this method? Any libraries for cleaning, any useful functions I'm not using, a better set of parameters for this functions, advice on image resolution, lighting...

Anything helps!

Also, I think this is a known issue, but many times if the image is not good Tesseract will recognize characters twice and print both results. Which is a good way to handle this?

Thanks for everything, Fran.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source