'detect bounding box text or non-text using tesseract

Q1 Please check the following code:

eng <- tesseract("eng")
ara <- tesseract("ara")
whitelist <- "1234567890-.,;:أةؤب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ا @ß€!$%&/()=?+"
text1 <- ocr("E:/OCR Test/Test Bill.jpeg",
             engine = tesseract(language = "ara",
                                options = list(tessedit_char_whitelist = whitelist)))
enter code here

how to search for a text extracted from (text1) e.g.: VAT NUMBER, and then give me the X1&Y1&X2&Y2 coordinates of word that I want to search it inside text1 (VAT NUMBER) ? Then how to know the coordinates of the Number after the word (VAT NUMBER) e.g.: 3101XXXXXX , I need to combine it together (VAT NUMBER ) and 3101XXXXXX in 1 or 2 strings After acknowledging the position text then copy it in another storing buffer e.g.: Data frame ? can anyone help me ? Please refer to the attached Image here : Invoice , Also I want to do the same for all Words and fields in the Image attached below and marked in red borders.

Q.2 I have another question if the Image contains 2 languages English & Arabic how I set bot languages in (text1) above?

enter image description here enter image description here

r ocr tesseract text-extraction

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'detect bounding box text or non-text using tesseract

Sources

Related Questions