'Removing duplicate links from scraper I'm making

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re


url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")

for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?



Solution 1:[1]

Use a set to remove duplicates. You call add() to add an item and if the item is already present then it won't be added again.

Try this:

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re

url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
    urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs

Note some URLs might start with http:// so may want to use the regexp ^https?:// to catch both http and https URLs.

You can also use set comprehension syntax to rewrite the assignment and for statements like this.

urls = {
    link.get("href")
    for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}

Solution 2:[2]

instead of printing it you need to catch is somehow to compare.

Try this:

you get a list with all result by find_all and make it a set.

data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))

for elem in data:
    print(elem)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Rabinzel