'Web scraping from google scholar with BeautifulSoup

Searching in Google Scholar the string "electronic knee" I retrieve about 14.000 results. Here is the link: https://scholar.google.com/scholar?start=10&q=electronic+knee&hl=it&as_sdt=8,5&as_ylo=2017&as_rr=1

Is it possible to obtain the results number (that are at the beginning of the page) through web scraping in python? I'm using bs4 (find_all function to get strings) library to retrieve results from each record but I would like to get the total number of results. What is the tag or is there another method?



Solution 1:[1]

This value is inside div with id="gs_ab_md" (at least on my computer)

import requests
from bs4 import BeautifulSoup as BS

url = 'https://scholar.google.com/scholar?start=10&q=electronic+knee&hl=it&as_sdt=8,5&as_ylo=2017&as_rr=1'

r = requests.get(url)
soup = BS(r.text, 'html.parser')

item = soup.find('div', {'id': 'gs_ab_md'})
print(item.text)

Result:

Pagina 2 di circa 14.400 risultati (0,02 sec)

And later you can use string functions to get only 14.440

For example

parts = item.text.split(' ')
print(parts[4])

Result:

14.400

Solution 2:[2]

As an alternative solution to furas answer, the same thing can be achieved by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in such a scenario is that you'll get the right number of results when the language is changed or if there's no page number displayed. For example, parts = item.text.split(' ')[4] from furas answer won't work anymore if page number is not displayed, it will throw an IndexError:

text = "Pagina 2 di circa 14.400 risultati (0,02 sec)".split(" ")[4]
broken_text = "14.400 risultati (0,02 sec)".split(" ")[4]

print(text, broken_text, sep="\n")

'''
14.400

broken_text = "14.400 risultati (0,02 sec)".split(" ")[4]
IndexError: list index out of range
'''

# The regular expression should be used to avoid such behavior. 

Code to integrate:

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",   # SerpApi API key
  "engine": "google_scholar",  # Google Scholar Organic results engine
  "q": "electronic knee",      # search query
  "hl": "it",                  # language
  "as_ylo": "2017"             # from year
}

search = GoogleSearch(params)
results = search.get_dict()

print(results["search_information"]["total_results"])  # always returns a total results

# 15700

Disclaimer, I work for SerpApi.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 furas
Solution 2 Dmitriy Zub