'Create array/dictionary from PDF with topic - page number pairs
Goal is to split a pdf of unknown formatting into an array in the following way
PDF-content:
TOPIC
texttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttext
TOPIC2
texttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttext ...
Page n
Next page, same PDF
texttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttext
TOPIC3
texttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttext ...
Page n+1
Array/Dictionary:
Content,PageNumber(s)
TOPIC+content of TOPIC as plain text without formatting,(n)
TOPIC2+content of TOPIC2 as plain text without formatting,(n,n+1)
TOPIC3+content of TOPIC3 as plain text without formatting,(n+1) or TOPIC3+content of TOPIC3 as plain text without formatting,(n+1,n+2,...) if it spans over more pages.
What I did:
I converted the pdf to html and split at each h1, h2 and h3.
Disadvantages
- PDF-Formatting gets lost
- Page Numbers get lost, as my PDF-HTML conversion program takes the document as a whole.
- More often than not, the conversion process comes up with new tags to define a heading, therefore splitting at the h tags results in an array of length 1.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
