'Why am I getting repetitive output while trying to scrape data from Google Scholar?
I am trying to scrape the PDF links from the search results from Google Scholar. I have tried to set a page counter based on the change in URL, but after the first eight output links, I am getting repetitive links as output.
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
#modifying the url as per page
urlCounter = 0
while urlCounter <=30:
urlPart1 = "http://scholar.google.com/scholar?start="
urlPart2 = "&q=%22entity+resolution%22&hl=en&as_sdt=0,4"
url = urlPart1 + str(urlCounter) + urlPart2
page = urllib2.Request(url,None,{"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"})
resp = urllib2.urlopen(page)
html = resp.read()
soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
recordCount = 0
while recordCount <=9:
recordPart1 = "gs_ggsW"
finRecord = recordPart1 + str(recordCount)
recordCount = recordCount+1
#printing the links
for link in soup.find_all('div', id = finRecord):
linkstring = str(link)
soup1 = BeautifulSoup(linkstring)
for link in soup1.find_all('a'):
print(link.get('href'))
Solution 1:[1]
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
Code and example in the online IDE to extract PDF's:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "entity resolution", # search query
"hl": "en" # language
}
# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for pdf_link in soup.select(".gs_or_ggsm a"):
pdf_file_link = pdf_link["href"]
print(pdf_file_link)
# output from the first page:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The main difference is that you only need to grab the data from structured JSON instead of figuring out how to extract the data from HTML, how to bypass blocks from search engines.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY", # SerpApi API key
"engine": "google_scholar", # Google Scholar organic reuslts
"q": "entity resolution", # search query
"hl": "en" # language
}
search = GoogleSearch(params)
results = search.get_dict()
for pdfs in results["organic_results"]:
for link in pdfs.get("resources", []):
pdf_link = link["link"]
print(pdf_link)
# output:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''
If you want to scrape more data from organic results, there's a dedicated Scrape Google Scholar with Python blog post of mine.
Disclaimer, I work for SerpApi.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dmitriy Zub |
