'Tesseract OCR Read Horizontally rather than Vertically C#

We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files. Here's an Example: Example tiff fiel that Tesseract reads

We are then outputting the data to a text file. However, Tesseract is reading the data in a Vertical fashion. In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this:

TYPE: DATE: Address: City: State: Owner: Owner Type: Acreage: Mortgage: 12345 2017-04-06 100 Main St. Some City Some State John Doe Primary 10.25 Yes

What we want is Tesseract to read the tiff file horizontally and have the output look like this:

TYPE:12345 DATE:2017-04-06 Address:100 Main St. City:Some City State:Some State Owner:John Doe Owner Type:Primary Acreage:10.25 Mortgage:Yes

We've tried the various Page Sementation options for Tesseract, but they all produce the same result.

Has anyone run into this same issue? Anybody have any ideas?



Solution 1:[1]

I found a solution. Tesseract has a set of config files. Inside several of these config files is the setting tessedit_pageseg_mode. This setting was set to 1 in all the config files. 1=Automatic page segmentation with OSD. OSD=Orientation and script detection.

Bottom line, these config file settings were overwriting our command line argument. Once I removed the tessedit_pageseg_mode parameter from the config files, our command line argument of

-psm 6 worked and produced the output data in the desired format.

psm=Page Segmentation Mode. 6=Assume a single uniform block of text

-psm 4 also worked

psm=Page Segmentation Mode. 4=Assume a single column of text of variable sizes

Solution 2:[2]

I know this is an old post but I ran into the same problem today.

setting the segmentation mode with engine.SetVariable("tessedit_pageseg_mode", 6); did not work.

And for some reason I didnt find it in the config files.

Solution:

engine.DefaultPageSegMode = PageSegMode.SingleBlock;

Solution 3:[3]

In c# the code would be:

using var input = new OcrInput(somePdfStream);
var config = new TesseractConfiguration() { PageSegmentationMode = TesseractPageSegmentationMode.SingleBlock };
var result = new IronTesseract(config).Read(input);
var text = result.Text;

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 MikeTWebb
Solution 2 Hans. M.
Solution 3 Johann