'How to extract bullet points, number list, in general the layout of the page from a pdf file using python

I am currently working on a project, which will be able to extract information from the slides that the university provides. And rebuild these slides in another layout. Therefore, in order to do that I need to extract the text from these slides, however, I need to know how to do that and maintain the layout like the bulleted list, numbered list, and etc, and try to differentiate them from the heading and all of that.

I have already tried to use these two libraries pdfminer, and PyPDF2. however, the returned text from these libraries is nothing more than just a string, for example like using this code:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

for page_layout in extract_pages(pdfPath):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

something like this will be the output:

Introduction to

What is Pair Programming?

 Two programmers work side-by-side at one computer,

collaborating on the same code

 One programmer (the driver) does the coding, and the other

programmer (the observer or navigator) continuously

 The two programmers switch roles periodically

Driver

Navigator

This is Pair Programming

Work side-by-side on one computer

This is Pair Programming

Continuously collaborating on the code

Switch role between Driver

Although, I have tried to use regex using the re libary, I am not sure that would work, because there are problems with this approach, like how to determine if it is at the end of one of the bulleted list or not. Since, there are pdf slides format are different from others.

Therefore, I want to know how to take this string and turn it into a format information. like where is the heading and where are the bulleted list and etc. Any suggestions on how to do that?

That would be really helpful.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source