'Convert PDF to HTML via PyMuPDF
For pages with tabular data in landscape format, the words in the HTML outcome overlap. For pages in portrait formats, the conversion is succesful. Any ideas how to fix that?
[Here is an example with the converted pdf to html in landscape format] [1]: https://i.stack.imgur.com/twbzw.png [2]: https://i.stack.imgur.com/Ln56P.png
import ntpath
from pathlib import Path
import fitz
doc = fitz.open(in_path) # open document
out = open(in_path + ".html", "wb") # open text output
for page in doc: # iterate the document pages
page.read_contents()
text = page.get_text('html', clip = None).encode("utf8")
out.write(text) # write text of page
out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
