'Scraping text in meta tag with selenium
I'm trying to get the book description from the following webpage: https://bookshop.org/books/lucky-9798200961177/9781668002452
This is what I've got so far
***EDIT***
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome('path_to_my_driver_on_local', options=options)
driver.get('https://bookshop.org/a/16709/9781668002452')
description = driver.find_element_by_xpath("//meta[@name='description']").get_attribute("content")
description
Basically, I'm trying to get the text inside of this html:
<meta name="description" content="REESE'S BOOK CLUB PICK NEW YORK TIMES BESTSELLER A thrilling roller-coaster ride about a heist gone terribly wrong, with a plucky protagonist who will win readers' hearts. What if you had the winning ticket ....">
I end up with the following error
Message: no such element: Unable to locate element: {"method":"xpath","selector":"//meta[@name='description']"}
Solution 1:[1]
elem=driver.find_element(By.XPATH,"//meta[@name='description']")
print(elem.get_attribute("content"))
You can use a more inclusive xpath. Then target the attribute for content.
Imports:
from selenium.webdriver.common.by import By
Solution 2:[2]
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome('path_to_my_driver_on_local', options=options)
driver.get('https://bookshop.org/books/lucky-9798200961177/9781668002452')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
Description = soup.find_all('div', class_="title-description")
print(Description[0].text)
Solution 3:[3]
You need to target the element with the correct xpath. Your value for the xpath //meta[@content] is returning the first meta element that contains a content attribute.
I would recommend using the xpath //meta[@name="description"] or the css selector meta[name="description"] for a more precise selection. This works perfectly:
# imports and boilerplate
....
description_meta_element = driver.find_element_by_css_selector('meta[name="description"]')
description_meta_content = description_meta_element.get_attribute('content')
print(description_meta_content)
Solution 4:[4]
This <meta> tag...
<meta name="description" content="REESE'S BOOK CLUB PICK NEW YORK TIMES BESTSELLER A thrilling roller-coaster ride about a heist gone terribly wrong, with a plucky protagonist who will win readers' hearts. What if you had the winning ticket ....">
...is within the <head> section. So Selenium won't be able to scrape this element.
Solution
In this case your best bet would be to use BeautifulSoup with urllib.request as follows:
from bs4 import BeautifulSoup
from urllib.request import urlopen # In python3, urllib2 has been split into urllib.request and urllib.error
webpage = urlopen('https://bookshop.org/books/lucky-9798200961177/9781668002452').read()
soup = BeautifulSoup(webpage, "lxml")
my_meta = soup.find("meta",{"name":"description"})
print(my_meta[content])
References
You can find a couple of relevant detailed discussions in:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Arundeep Chohan |
| Solution 2 | ma9 |
| Solution 3 | Dharman |
| Solution 4 | undetected Selenium |
