'Beautifoulsoup Amazon Product Detail
I can't scrape the "Product Details" section (scrolling down the webpage you'll find it) html by using requests or requests_html. Find_all returns a 0 size object... Any Help?
from requests import session
from requests_html import HTMLSession
s = HTMLSession()
#s = session()
r = s.get("https://www.amazon.com/dp/B094HWN66Y")
soup = BeautifulSoup(r.text, 'html.parser')
len(soup.find_all("div", {"id":"detailBulletsWrapper_feature_div"}))
Solution 1:[1]
This is an example of how to scrape the title of the product using bs4 and requests, easily expandable to getting other info from the product.
The reason yours doesn't work is your request has no headers so Amazon realises your a bot and doesn't want you scraping their site. This is shown by your request being returned as <Response [503]> and explained in r.text.
I believe Amazon have an API for this (that they'd probably like you to use) but it'll be fine to scrape like this for small-scale stuff.
import requests
import bs4
# Amazon don't like you scrapeing them however these headers should stop them from noticing a small number of requests
HEADERS = ({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.157 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
def main():
url = "https://www.amazon.com/dp/B094HWN66Y"
title = get_title(url)
print("The title of %s is: %s" % (url, title))
def get_title(url: str) -> str:
"""Returns the title of the amazon product."""
# The request
r = requests.get(url, headers=HEADERS)
# Parse the content
soup = bs4.BeautifulSoup(r.content, 'html.parser')
title = soup.find("span", attrs={"id": 'productTitle'}).string
return title
if __name__ == "__main__":
main()
Output:
The title of https://www.amazon.com/dp/B094HWN66Y is: Will They, Won't They?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
