'XPath getting a specific set of elements within a class

I am scraping Google Scholar and have trouble getting the right XPath expression. When I inspect the wanted elements it returns me expressions like these:

//*[@id="gs_res_ccl_mid"]/div[2]/div[2]/div[3]/a[3]
//*[@id="gs_res_ccl_mid"]/div[3]/div/div[3]/a[3]
// *[@id="gs_res_ccl_mid"]/div[6]/div[2]/div[3]/a[3]

I ended up with the generic expression:

//*[@id="gs_res_ccl_mid"]//a[3]

Also tried the alternative, with similar results:

//*[@id="gs_res_ccl_mid"]/div*/div*/div*/a[3]

The output is something like (I can not post the entire result set because I dont't have 10 points of reputation):

[
'https://scholar.google.es/scholar?cites=5812018205123467454&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/citations?user=EOc3O8AAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/citations?user=nd8O1XQAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/scholar?cites=15483392402856138853&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=7733120668292842687&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=15761030700327980189&as_sdt=2005&sciodt=0,5&hl=es'
]

The problem with the output is that there are 3 unwanted elements extras and they all have this piece of text citations?user. What can I do to rid me off the unwanted elements?

My code:

def paperOthers(exp,atr=None): 
                  
     thread = browser.find_elements(By.XPATH,(" %s" % exp))
   
     xArray = []
    
     for t in thread:
         if atr == 0:
             xThread = t.get_attribute('id')
         elif atr == 1:                
             xThread = t.get_attribute('href')
         else:
             xThread = t.text         
         xArray.append(xThread)  
        
     return xArray

Which I call with:

rcites = paperOthers("//*[@id='gs_res_ccl_mid']//a[3]", 1)


Solution 1:[1]

Change the XPath to exclude the items with text.

rcites = paperOthers("//*[@id='gs_res_ccl_mid']//a[3][not(contains(.,'citations?user'))]",1)

Solution 2:[2]

XPath expression could be as simple as //*[@class="gs_fl"]/a[3]/@href:

  • //* selects all elements in the document until it hits a followed @class.
  • [@class="gs_fl"] selects element node with gs_fl class attribute.
  • /a[3] selects the third <a> element that is the child of the gs_fl class element.
  • /@href selects href attribute of an <a> element.

A w3schools XPath syntax reminder.


Code and full example in the online IDE:

from parsel import Selector
import requests

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "biology",  # search query
    "hl": "en"       # language
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
# used to act as a "real" user visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

for cite_by in selector.xpath('//*[@class="gs_fl"]/a[3]/@href'):
    cited_by_link = f"https://scholar.google.com/{cite_by.get()}"
    print(cited_by_link)

# output:
"""
https://scholar.google.com//scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""

Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi.

It's a paid API with a free plan that you can use without the need to figure out how to scrape the data and maintain it over time, how to scale it without getting blocked by the search engine, find reliable proxy providers, or CAPTCHA solving services.

Example code to integrate:

from serpapi import GoogleScholarSearch
import os

params = {
    "api_key": os.getenv("API_KEY"), # SerpApi API key
    "engine": "google_scholar",      # scraping search engine
    "q": "biology",                  # search query
    "hl": "en"                       # langugage
}

search = GoogleScholarSearch(params)
results = search.get_dict()

for cited_by in results["organic_results"]:
    cited_by_link = cited_by["inline_links"]["cited_by"]["link"]
    print(cited_by_link)

# output:
"""
https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""

Disclaimer, I work for SerpApi.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dmitriy Zub
Solution 2