'How to integrate tesseract-ocr with tika?
I need to integrate the tesseract-ocr which converts scanned image as pdf to text.
There is tesseractOCRParser already available.
But there is no invoke method given.
When I am trying to build tika with tesseract-ocr referral path I am getting the following error
Results:
Failed tests:
testNoConfig(org.apache.tika.parser.ocr.TesseractOCRConfigTest):
Invalid default tesseractPath value expected:<[]> but was:
<[/home/serendio/tesseract-ocr/]>
Tests run: 569, Failures: 1, Errors: 0, Skipped: 7
Can anyone help me out ???
Or any other-way to resolve this problem??
Solution 1:[1]
I think this can help : https://wiki.apache.org/tika/TikaOCR I followed this guide and I was able to easily extract the content! I simply installed Tesseract and then Tika.
Using Tika 1.9 I was easily able to : - extract the content directly calling a local Tika server - extract the content in a custom application ( you can use the tika-example project) with no effort .
No modification was needed. Everything working out of the box.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
