'Create array/dictionary from PDF with topic - page number pairs

Goal is to split a pdf of unknown formatting into an array in the following way

PDF-content:

TOPIC

texttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttext

TOPIC2

texttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttext ...

Page n

Next page, same PDF

texttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttext

TOPIC3

texttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttexttext ...

Page n+1

Array/Dictionary:

Content,PageNumber(s)
TOPIC+content of TOPIC as plain text without formatting,(n)
TOPIC2+content of TOPIC2 as plain text without formatting,(n,n+1)
TOPIC3+content of TOPIC3 as plain text without formatting,(n+1) or TOPIC3+content of TOPIC3 as plain text without formatting,(n+1,n+2,...) if it spans over more pages.

What I did:

I converted the pdf to html and split at each h1, h2 and h3.

Disadvantages

  1. PDF-Formatting gets lost
  2. Page Numbers get lost, as my PDF-HTML conversion program takes the document as a whole.
  3. More often than not, the conversion process comes up with new tags to define a heading, therefore splitting at the h tags results in an array of length 1.


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source