'Why does BeautifulSoup.find not work when using a for loop?

I am iterating through a list of URL's to extract 5 items from each URL using BeautifulSoup.find. The list contains about 2000 URL's. Because not every webpage is guaranteed to have all 5 items, I used try and except appropriately.

After completing the loop, I noticed 3 things:

  1. The very first 5-10 links would run seamlessly meaning I would successfully retrieve all 5 items (none of the except blocks were used).
  2. The overwhelming majority of URL's try blocks did not execute, and therefore it ran the except block for each item.
  3. Everyone once in a while, a URL's try blocks DID execute and I would successfully retrieve all 5 items.

I placed the results in a list of dictionaries, and then created a dataframe.

cleanserlist = []
for link in productlinks:
    
    try:
        r = requests.get(link, headers=headers, timeout=3.05)
    except requests.exceptions.Timeout:
        print("Timeout occurred")
    
    soup = BeautifulSoup(r.content, 'lxml')
    
    try:
        price = soup.find('span', class_="sellingPrice").text.strip()
    except:
        price = 'no price'
    try:
        name = soup.find('h1', class_='flex flex-xs-100').text.strip()
    except:
        name = 'no name'
    try: 
        ingredients = soup.find('div', class_='v-pane-content').text.strip()
    except:
        ingredients = 'no ingredients'
    try:
        rating = soup.find('div', class_='ratingValue').text.strip()
    except:
        rating = 'no rating'
    try:
        reviews = soup.find('span', class_='reviewCount').text.strip()
    except:
        reviews = 'no reviews'
    
    cleanser = {
        'name': name,
        'price': price,
        'rating': rating,
        'reviews' : reviews,
        'ingredients': ingredients
    }
    cleanserlist.append(cleanser)
    sleep(randint(1,3))

image of first 44 rows of dataframe

image of subsequent 44 rows of dataframe



Solution 1:[1]

A "table driven" approach is highly appropriate for this kind of thing and makes for easier extensibility.

Given that there are a large number of URLs to [try to] access then a multithreaded approach is highly desirable for potentially greatly improved performance.

Here's an example of that kind of approach:

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

CONTROL = [('price', 'span', 'sellingPrice'),
           ('name', 'h1', 'flex flex-xs-100'),
           ('ingredients', 'div', 'v-pane-content'),
           ('rating', 'div', 'ratingValue'),
           ('reviews', 'span', 'reviewCount')
           ]
AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'
TIMEOUT = 3.05
cleanserlist = []
productlinks = [] # list of URLs

headers = {'User-Agent': AGENT} # potentially more complex

def process(link):
    try:
        (r := requests.get(link, headers=headers, timeout=TIMEOUT)).raise_for_status()
        cleanser = {}
        soup = BeautifulSoup(r.text, 'lxml')
        for v, e, c in CONTROL:
            try:
                cleanser[v] = soup.find(e, class_=c).text.strip()
            except Exception:
                cleanser[v] = f'no {v}'
        cleanserlist.append(cleanser)
    except Exception as e:
        print(f'Error processing {link} due to {e}')


def main():
    with ThreadPoolExecutor() as executor:
        executor.map(process, productlinks)
    print(cleanserlist)

if __name__ == '__main__':
    main()

Solution 2:[2]

In case someone doesn't know about HTML entities (just like me) and needs the answer.

Thanks to Amadan's comment, I just learned that the strange thing I got instead of my accent was called an HTML entity.

In order to get back my accent, I needed to unescape it :

import html

print(html.unescape("Corrigés exercices entrainement chapitre mouvement et forces"))

>> Corrigés exercices entrainement chapitre mouvement et forces

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Bonsai Noodle