'Convert PDF to HTML via PyMuPDF

For pages with tabular data in landscape format, the words in the HTML outcome overlap. For pages in portrait formats, the conversion is succesful. Any ideas how to fix that?

[Here is an example with the converted pdf to html in landscape format] [1]: https://i.stack.imgur.com/twbzw.png [2]: https://i.stack.imgur.com/Ln56P.png

import ntpath
from pathlib import Path
import fitz

doc = fitz.open(in_path)  # open document
out = open(in_path + ".html", "wb")  # open text output
for page in doc:  # iterate the document pages
    page.read_contents()
    text = page.get_text('html', clip = None).encode("utf8")  
    out.write(text)  # write text of page
    out.write(bytes((12,)))  # write page delimiter (form feed 0x0C)
out.close()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Convert PDF to HTML via PyMuPDF

Sources

Related Questions