'extract the number of results from google search
I am writing a web scraper to extract the number of results of searching in a google search which appears on the top left of the page of search results. I have written the code below but I do not understand why phrase_extract is None. I want to extract the phrase "About 12,010,000,000 results". which part I am making a mistake? may be parsing the HTML incorrectly?
import requests
from bs4 import BeautifulSoup
def pyGoogleSearch(word):
address='http://www.google.com/#q='
newword=address+word
#webbrowser.open(newword)
page=requests.get(newword)
soup = BeautifulSoup(page.content, 'html.parser')
phrase_extract=soup.find(id="resultStats")
print(phrase_extract)
pyGoogleSearch('world')
Solution 1:[1]
You're actually using the wrong url to query google's search engine. You should be using http://www.google.com/search?q=<query>
.
So it'd look like this:
def pyGoogleSearch(word):
address = 'http://www.google.com/search?q='
newword = address + word
page = requests.get(newword)
soup = BeautifulSoup(page.content, 'html.parser')
phrase_extract = soup.find(id="resultStats")
print(phrase_extract)
You also probably just want the text of that element, not the element itself, so you can do something like
phrase_text = phrase_extract.text
or to get the actual value as an integer:
val = int(phrase_extract.text.split(' ')[1].replace(',',''))
Solution 2:[2]
You could also try to see what output would be from div
above. Sometimes it will show the output.
Also, make sure you're using user-agent
since Google could treat your script as a tablet user-agent
(of something different) with different .class
, #id
tags, and so on. This could be the reason why your output is empty []
.
Here's the code and replit.com to see the number of search results:
from lxml import html
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?q=beautiful+cookies',
headers=headers,
stream=True)
response.raw.decode_content = True
tree = html.parse(response.raw)
# lxml is used to select element by XPath
# Requests + lxml: https://stackoverflow.com/a/11466033/1291371
# note: you can achieve it easily with bs4 as well by grabbing "#result-stats" id selector.
result = tree.xpath('//*[@id="result-stats"]/text()')[0]
print(result)
# About 3,890,000,000 results
Alternatively, you can use Google Search Engine Results API from SerpApi to achieve the same but in more easy fashion.
Part of JSON:
"search_information": {
"organic_results_state":"Results for exact spelling",
"total_results": 3890000000,
"time_taken_displayed": 0.65,
"query_displayed": "beautiful cookies"
}
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "beautiful cookies",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 4210000000
Discrailmer, I work for SerpApi.
Solution 3:[3]
If you don't mind using just command line, try filtering with htmlq
:
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
term="something"
curl --silent -A "$user_agent" "https://www.google.com/search?hl=en&q=$term" | htmlq "#result-stats" | grep -o "About.*results" | grep -o '[0-9]' | tr -d "\n"
You could try other user agents to avoid this 403 error.
There are better ways (probably with awk
or sed
) instead of grep
and td
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Pablo Bianchi |