'Listing all the links of ANY given dynamic webpage with Python

I cannot find a way to create a generic crawler that can recieve a webpage and list all the links inside of it, the purpose is to inspect an entire domain and all its internal links.

I've tried doing it with HtmlUnit(Java) and with Selenium(Python) but the search of the internal links always has to be indicated by a specific tag or id, and I need this to work with any (or most) given pages, and every page uses a different structure.

Thank you so much for your help



Solution 1:[1]

BeautifulSoup has an extensive toolkit for filtering HTML. For example, you could filter for any linked object that has an href attribute set. E.g.

example taken from documentation

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

See more on https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 davidverweij