'HOCR Combine from input subfolder to output subfolder

I was looking for this function where I could combine the hocr data available in a single subfolder with a file name to similar sub-folder as output.

#!/usr/bin/env python

from __future__ import print_function
import argparse

from lxml import etree, html

################################################################
# main program
################################################################

parser = argparse.ArgumentParser(
    description="combine multiple hOCR documents into one")
parser.add_argument(
    "filenames", help="hOCR files", nargs='+')
args = parser.parse_args()

doc = html.parse(args.filenames[0])
pages = doc.xpath("F://Testing//input//1//*[@class='ocr_page']")
container = pages[-1].getparent()

for fname in args.filenames[1:]:
    doc2 = html.parse(fname)
    pages = doc2.xpath("F://Testing//output//2//*[@class='ocr_page']")
    for page in pages:
        container.append(page)

print(etree.tostring(doc, pretty_print=True).decode('UTF-8'))

Source : https://github.com/ocropus/hocr-tools/blob/master/hocr-combine



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source