'How can I train Tesseract to recognize a dotted zero?
Question
How can I train Tesseract OCR to recognize a 0 as a zero, or hint to it that the zeroes are dotted? It is often recognized as a 6/8/9 with a 0% confidence that it is in fact a zero.
Here is a sample image. It is currently parsed as follows, with incorrectly parsed values highlighted in bold:
Input
Output
| X | Y | Z |
|---|---|---|
| 0.3 | 0.0 | 0.0 |
| 1.8 | 0.0 | 0.0 |
| 3.8 | 0.3 | 06.06 |
| 1.1 | 1.2 | 0.0 |
| 06.9 | 0.8 | 0.0 |
| 3.0 | 3.1 | 06.0 |
| 1.7 | 0.6 | 0.0 |
Source Code
I am using IronOCR with Tesseract to parse. Here is my configuration for the parser:
Input.AddPdf("myfile.pdf");
Input.Deskew(); // fixes rotation and perspective
Input.DeNoise(); // fixes digital noise and poor scanning
Ocr.Configuration.BlackListCharacters = "X@©®¢*%,";
Ocr.Language = OcrLanguage.EnglishBest;
Solution 1:[1]
As seen from your sample input image, you're only trying to recognize the numbers and dot sign. So using the default eng.traineddata is definetly decreasing your accuracy. Default model has the classes that you will never try to recognize and it trained with different fonts too. You should train your own model and use it to recognize numbers only with the specific font you're using in your input image. Also, which you should upgrade your tesseract version, tesseract 4.0 and above has LSTM inference so it have a relatively higher accuracy.
You can follow the official docs to learn how to train a custom tesseract model.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ege Y?ld?r?m |

