'How to programmatically detect if google is blocking me from making any further requests?
Context:
- I have a web scrapper running in a hosted web app in Heroku.
- It scrapes the google search page to get some required information.
- I am using the
requestpackage
This is my code:
# this is a method inside a class
def get_weather_component(self):
# Send request and store the webpage that comes as a response
s = requests.Session()
s.headers["User-Agent"] = self.USER_AGENT
s.headers["Accept-Language"] = self.LANGUAGE
s.headers["Content-Language"] = self.LANGUAGE
html = s.get(self.url)
Note: I know that I can check the status code to see if it is err 429
Issues:
- But can there be any other possible reason for the request being blocked that needs to be handled?
- What is the minimum time gap between requests required for Google?
Any suggestion gratefully received. Thanks in advance.
Solution 1:[1]
There is an API for Google Search. Google has probably placed a limit on the number of requests coming from the same IP.
- slow down until you figure out the limit. (add a thread.sleep() or something like that)
- run it on several servers allowing you to appear to come from different IP addresses. (deploy your application in a containerized environment)
- stop trying to directly crawl Google for search data and try to use their RESTful API instead.
Solution 2:[2]
import requests
headers = {"Content-Type": "application/json"}
data = {
"Accept-Language": self.LANGUAGE
"Content-Language": self.LANGUAGE
}
r = requests.get(self.url, data=data, headers=headers)
print(r.status_code)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Varadharajan Raghavendran |
| Solution 2 | gre_gor |
