'Web scraping from google scholar with BeautifulSoup
Searching in Google Scholar the string "electronic knee" I retrieve about 14.000 results. Here is the link: https://scholar.google.com/scholar?start=10&q=electronic+knee&hl=it&as_sdt=8,5&as_ylo=2017&as_rr=1
Is it possible to obtain the results number (that are at the beginning of the page) through web scraping in python? I'm using bs4 (find_all function to get strings) library to retrieve results from each record but I would like to get the total number of results. What is the tag or is there another method?
Solution 1:[1]
This value is inside div with id="gs_ab_md" (at least on my computer)
import requests
from bs4 import BeautifulSoup as BS
url = 'https://scholar.google.com/scholar?start=10&q=electronic+knee&hl=it&as_sdt=8,5&as_ylo=2017&as_rr=1'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
item = soup.find('div', {'id': 'gs_ab_md'})
print(item.text)
Result:
Pagina 2 di circa 14.400 risultati (0,02 sec)
And later you can use string functions to get only 14.440
For example
parts = item.text.split(' ')
print(parts[4])
Result:
14.400
Solution 2:[2]
As an alternative solution to furas answer, the same thing can be achieved by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in such a scenario is that you'll get the right number of results when the language is changed or if there's no page number displayed. For example, parts = item.text.split(' ')[4] from furas answer won't work anymore if page number is not displayed, it will throw an IndexError:
text = "Pagina 2 di circa 14.400 risultati (0,02 sec)".split(" ")[4]
broken_text = "14.400 risultati (0,02 sec)".split(" ")[4]
print(text, broken_text, sep="\n")
'''
14.400
broken_text = "14.400 risultati (0,02 sec)".split(" ")[4]
IndexError: list index out of range
'''
# The regular expression should be used to avoid such behavior.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY", # SerpApi API key
"engine": "google_scholar", # Google Scholar Organic results engine
"q": "electronic knee", # search query
"hl": "it", # language
"as_ylo": "2017" # from year
}
search = GoogleSearch(params)
results = search.get_dict()
print(results["search_information"]["total_results"]) # always returns a total results
# 15700
Disclaimer, I work for SerpApi.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | furas |
| Solution 2 | Dmitriy Zub |
