'Extract heading and content from an HTML page using a visual approach in Python

I'm looking for a way to extract the heading and content from raw HTML. There are a couple of Python packages out there which does this (Newspaper3k, python-readability, python-goose), but I'm looking to do something more like how the human eye sees. My idea is to use the visual placement of a div on a page to determine if it's part of the main content of a page or not. How can I extract the placement of a div using python? Any other ideas on how to approach this problem?

python html

Solution 1:^[1]

To the best of my understanding, you want to locate and extract html from certain divs from a website, but on screen, with a cursor and a keyboard (like a human would do), for that purpose, you could go with PyAutoGui.

You can use pyautogui.locateOnScreen(), with a parameter of choice, you can then advance with scrapping tools. With PyAutoGui, you can automate click events as well.

For further research, you can check the docs.

Hope this answers your question, if doubts, please feel free to ask!

Solution 2:^[2]

As you mentioned, the worst part of the Python packages you mentioned is the required HTML and DOM structure knowledge. Nevertheless, it is necessary for scraping I can share a hybrid approach.

First step: I use WebScraper.io Chrome extension to visually select items on the page (like on the image) and save them.

Second step: Once I have DOM selectors like p a.cta (on the image). I use them with the Python scraping package.

I use this approach almost for any scraping project. I hope it helps.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Strange
Solution 2	Suat Atan PhD

'Extract heading and content from an HTML page using a visual approach in Python

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]