'How to scrape specific information on a website
Here's my script :
import re
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
URLs = ['https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html']
Marques = []
Brands = []
Refs = []
Prices = []
#Carts = []
#Links = []
References = []
Links = []
for url in URLs:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
Marques.append('IWC')
Brand = soup.find('span', class_ = 'iwc-buying-options-title').text
Brand = str(Brand)
Brand = re.sub("Ajouter à la liste de souhaits", '', Brand)
Brand = re.sub("\n", '', Brand)
Brands.append(Brand)
Price.append(soup.find('div', class_ = 'iwc-buying-options-price').get_text(strip=True))
Links.append(url)
References.append(soup.find('h1', class_ = 'iwc-buying-options-reference').text)
print(Brand)
print(Price)
print(Links)
print(References)
Unfortunately, Brand give me that : [" Grande Montre d'Aviateur\xa043 "]
References give me that : ['\n IW329303\n ']
And Price give me nothing, I think it's bcause it's not some sort of text as you can see :
print(soup.find('div', class_ = 'iwc-buying-options-price')
<div class="iwc-buying-options-price"></div>
Any ideas how to do that ?
I would like this output :
Solution 1:[1]
You'll want to use .strip() to get rid of that white space:
so for example you want Brand = soup.find('span', class_ = 'iwc-buying-options-title').text.strip()
Price unfortuntly not as easy. The page is dynamic meaning that html tag does not have the price/content in the static request. It is though in the form of json in another tag:
import requests
from bs4 import BeautifulSoup
import pandas as pd
rows = []
URLs = ['https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html',
'https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw392202-pilot_s-watch-perpetual-calendar-chronograph-edition-le.html']
for url in URLs:
productUrl = url.replace('.html', '.productinfo.FR.json')
jsonData = requests.get(productUrl).json()
productId = list(jsonData.keys())[0]
data = list(jsonData.values())[0]
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
brand = soup.find('span', class_ = 'iwc-buying-options-title').text.strip().split('\n')[0]
price = data['price']
addToCart = data['stock']
row = {
'product':brand,
'productId':productId,
'price':price,
'add_to_cart':addToCart,
'link':url}
rows.append(row)
df = pd.DataFrame(rows)
inStock_df = df[df['add_to_cart'] == True]
Output:
print(df.to_string())
product productId price add_to_cart link
0 Grande Montre d'Aviateur 43 IWIW329303 9100.00 True https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html
1 Montre d'Aviateur Calendrier Perpétuel Chronographe Édition «Le Petit Prince» IWIW392202 41100.00 False https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw392202-pilot_s-watch-perpetual-calendar-chronograph-edition-le.html
And to get just in stock:
print(inStock_df.to_string())
product productId price add_to_cart link
0 Grande Montre d'Aviateur 43 IWIW329303 9100.00 True https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
