'Why does BeautifulSoup.find not work when using a for loop?
I am iterating through a list of URL's to extract 5 items from each URL using BeautifulSoup.find. The list contains about 2000 URL's. Because not every webpage is guaranteed to have all 5 items, I used try and except appropriately.
After completing the loop, I noticed 3 things:
- The very first 5-10 links would run seamlessly meaning I would successfully retrieve all 5 items (none of the except blocks were used).
- The overwhelming majority of URL's try blocks did not execute, and therefore it ran the except block for each item.
- Everyone once in a while, a URL's try blocks DID execute and I would successfully retrieve all 5 items.
I placed the results in a list of dictionaries, and then created a dataframe.
cleanserlist = []
for link in productlinks:
try:
r = requests.get(link, headers=headers, timeout=3.05)
except requests.exceptions.Timeout:
print("Timeout occurred")
soup = BeautifulSoup(r.content, 'lxml')
try:
price = soup.find('span', class_="sellingPrice").text.strip()
except:
price = 'no price'
try:
name = soup.find('h1', class_='flex flex-xs-100').text.strip()
except:
name = 'no name'
try:
ingredients = soup.find('div', class_='v-pane-content').text.strip()
except:
ingredients = 'no ingredients'
try:
rating = soup.find('div', class_='ratingValue').text.strip()
except:
rating = 'no rating'
try:
reviews = soup.find('span', class_='reviewCount').text.strip()
except:
reviews = 'no reviews'
cleanser = {
'name': name,
'price': price,
'rating': rating,
'reviews' : reviews,
'ingredients': ingredients
}
cleanserlist.append(cleanser)
sleep(randint(1,3))
Solution 1:[1]
A "table driven" approach is highly appropriate for this kind of thing and makes for easier extensibility.
Given that there are a large number of URLs to [try to] access then a multithreaded approach is highly desirable for potentially greatly improved performance.
Here's an example of that kind of approach:
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
CONTROL = [('price', 'span', 'sellingPrice'),
('name', 'h1', 'flex flex-xs-100'),
('ingredients', 'div', 'v-pane-content'),
('rating', 'div', 'ratingValue'),
('reviews', 'span', 'reviewCount')
]
AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'
TIMEOUT = 3.05
cleanserlist = []
productlinks = [] # list of URLs
headers = {'User-Agent': AGENT} # potentially more complex
def process(link):
try:
(r := requests.get(link, headers=headers, timeout=TIMEOUT)).raise_for_status()
cleanser = {}
soup = BeautifulSoup(r.text, 'lxml')
for v, e, c in CONTROL:
try:
cleanser[v] = soup.find(e, class_=c).text.strip()
except Exception:
cleanser[v] = f'no {v}'
cleanserlist.append(cleanser)
except Exception as e:
print(f'Error processing {link} due to {e}')
def main():
with ThreadPoolExecutor() as executor:
executor.map(process, productlinks)
print(cleanserlist)
if __name__ == '__main__':
main()
Solution 2:[2]
In case someone doesn't know about HTML entities (just like me) and needs the answer.
Thanks to Amadan's comment, I just learned that the strange thing I got instead of my accent was called an HTML entity.
In order to get back my accent, I needed to unescape it :
import html
print(html.unescape("Corrigés exercices entrainement chapitre mouvement et forces"))
>> Corrigés exercices entrainement chapitre mouvement et forces
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Bonsai Noodle |
