'How to read pdf text and write to a document while conserving formatting using python?
The code below prints the pdf text perfectly in the console using print(page.extract_text()). I can copy the text from the console and paste it into word and the formatting is conserved. However, if I use docx to save the text to a word document like document.save("word.docx"), the formatting is changed and I have to manually correct it. Does anyone know how to save the pdf text while conserving the formatting?
from docx import Document
import os
import glob
document = Document("C:\path\Test.docx")
path = "C:\\path\\"
os.chdir(path)
for pdf_file in glob.glob("*.pdf"):
print(pdf_file)
with pdfplumber.open(pdf_file) as pdf:
for page in pdf.pages:
print(page.extract_text())
text = page.extract_text()
content = document.add_paragraph(text)
document.save("word.docx")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
