'XPath getting a specific set of elements within a class
I am scraping Google Scholar and have trouble getting the right XPath expression. When I inspect the wanted elements it returns me expressions like these:
//*[@id="gs_res_ccl_mid"]/div[2]/div[2]/div[3]/a[3]
//*[@id="gs_res_ccl_mid"]/div[3]/div/div[3]/a[3]
// *[@id="gs_res_ccl_mid"]/div[6]/div[2]/div[3]/a[3]
I ended up with the generic expression:
//*[@id="gs_res_ccl_mid"]//a[3]
Also tried the alternative, with similar results:
//*[@id="gs_res_ccl_mid"]/div*/div*/div*/a[3]
The output is something like (I can not post the entire result set because I dont't have 10 points of reputation):
[
'https://scholar.google.es/scholar?cites=5812018205123467454&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/citations?user=EOc3O8AAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/citations?user=nd8O1XQAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/scholar?cites=15483392402856138853&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=7733120668292842687&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=15761030700327980189&as_sdt=2005&sciodt=0,5&hl=es'
]
The problem with the output is that there are 3 unwanted elements extras and they all have this piece of text citations?user. What can I do to rid me off the unwanted elements?
My code:
def paperOthers(exp,atr=None):
thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []
for t in thread:
if atr == 0:
xThread = t.get_attribute('id')
elif atr == 1:
xThread = t.get_attribute('href')
else:
xThread = t.text
xArray.append(xThread)
return xArray
Which I call with:
rcites = paperOthers("//*[@id='gs_res_ccl_mid']//a[3]", 1)
Solution 1:[1]
Change the XPath to exclude the items with text.
rcites = paperOthers("//*[@id='gs_res_ccl_mid']//a[3][not(contains(.,'citations?user'))]",1)
Solution 2:[2]
XPath expression could be as simple as //*[@class="gs_fl"]/a[3]/@href:
//*selects all elements in the document until it hits a followed@class.[@class="gs_fl"]selects element node withgs_flclass attribute./a[3]selects the third<a>element that is the child of thegs_flclass element./@hrefselectshrefattribute of an<a>element.
A w3schools XPath syntax reminder.
Code and full example in the online IDE:
from parsel import Selector
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "biology", # search query
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
# used to act as a "real" user visit
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
for cite_by in selector.xpath('//*[@class="gs_fl"]/a[3]/@href'):
cited_by_link = f"https://scholar.google.com/{cite_by.get()}"
print(cited_by_link)
# output:
"""
https://scholar.google.com//scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi.
It's a paid API with a free plan that you can use without the need to figure out how to scrape the data and maintain it over time, how to scale it without getting blocked by the search engine, find reliable proxy providers, or CAPTCHA solving services.
Example code to integrate:
from serpapi import GoogleScholarSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # scraping search engine
"q": "biology", # search query
"hl": "en" # langugage
}
search = GoogleScholarSearch(params)
results = search.get_dict()
for cited_by in results["organic_results"]:
cited_by_link = cited_by["inline_links"]["cited_by"]["link"]
print(cited_by_link)
# output:
"""
https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""
Disclaimer, I work for SerpApi.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dmitriy Zub |
| Solution 2 |
