'How to scrape specific information on a website

Here's my script :

import re
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

URLs = ['https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html']

Marques = []
Brands = []
Refs = []
Prices = []
#Carts = []
#Links = []
References = []
Links = []

for url in URLs:

    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")


    Marques.append('IWC')

    Brand = soup.find('span', class_ = 'iwc-buying-options-title').text
    Brand = str(Brand)
    Brand = re.sub("Ajouter à la liste de souhaits", '', Brand)

    Brand = re.sub("\n", '', Brand)
    Brands.append(Brand)

    Price.append(soup.find('div', class_ = 'iwc-buying-options-price').get_text(strip=True))

    Links.append(url)

    References.append(soup.find('h1', class_ = 'iwc-buying-options-reference').text)

print(Brand)
print(Price)
print(Links)
print(References)

Unfortunately, Brand give me that : [" Grande Montre d'Aviateur\xa043 "]

References give me that : ['\n IW329303\n ']

And Price give me nothing, I think it's bcause it's not some sort of text as you can see :

print(soup.find('div', class_ = 'iwc-buying-options-price')
<div class="iwc-buying-options-price"></div>

Any ideas how to do that ?

I would like this output :

outputdesired



Solution 1:[1]

You'll want to use .strip() to get rid of that white space:

so for example you want Brand = soup.find('span', class_ = 'iwc-buying-options-title').text.strip()

Price unfortuntly not as easy. The page is dynamic meaning that html tag does not have the price/content in the static request. It is though in the form of json in another tag:

import requests
from bs4 import BeautifulSoup
import pandas as pd

rows = []
URLs = ['https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html',
        'https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw392202-pilot_s-watch-perpetual-calendar-chronograph-edition-le.html']
for url in URLs:
    productUrl = url.replace('.html', '.productinfo.FR.json')
    jsonData = requests.get(productUrl).json()
    
    productId = list(jsonData.keys())[0]
    data = list(jsonData.values())[0]
    
    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")
    
    brand = soup.find('span', class_ = 'iwc-buying-options-title').text.strip().split('\n')[0]
    price = data['price']
    addToCart = data['stock']
    
    row = {
        'product':brand,
        'productId':productId,
        'price':price,
        'add_to_cart':addToCart,
        'link':url}
    
    rows.append(row)
    
df = pd.DataFrame(rows)
inStock_df = df[df['add_to_cart'] == True]

Output:

print(df.to_string())
                                                                         product   productId     price  add_to_cart                                                                                                                             link
0                                                    Grande Montre d'Aviateur 43  IWIW329303   9100.00         True                                      https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html
1  Montre d'Aviateur Calendrier Perpétuel Chronographe Édition «Le Petit Prince»  IWIW392202  41100.00        False  https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw392202-pilot_s-watch-perpetual-calendar-chronograph-edition-le.html

And to get just in stock:

print(inStock_df.to_string())
                       product   productId    price  add_to_cart                                                                                         link
0  Grande Montre d'Aviateur 43  IWIW329303  9100.00         True  https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1