'detect bounding box text or non-text using tesseract
Q1 Please check the following code:
eng <- tesseract("eng")
ara <- tesseract("ara")
whitelist <- "1234567890-.,;:أةؤب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ا @߀!$%&/()=?+"
text1 <- ocr("E:/OCR Test/Test Bill.jpeg",
engine = tesseract(language = "ara",
options = list(tessedit_char_whitelist = whitelist)))
enter code here
how to search for a text extracted from (text1) e.g.: VAT NUMBER, and then give me the X1&Y1&X2&Y2 coordinates of word that I want to search it inside text1 (VAT NUMBER) ? Then how to know the coordinates of the Number after the word (VAT NUMBER) e.g.: 3101XXXXXX , I need to combine it together (VAT NUMBER ) and 3101XXXXXX in 1 or 2 strings After acknowledging the position text then copy it in another storing buffer e.g.: Data frame ? can anyone help me ? Please refer to the attached Image here : Invoice , Also I want to do the same for all Words and fields in the Image attached below and marked in red borders.
Q.2 I have another question if the Image contains 2 languages English & Arabic how I set bot languages in (text1) above?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
