'Scrape google search snippet results
I'm trying to write a small program, that you input a search query, it opens your browswer with the result and then scrapes the google search result and prints it, i don't know how i would go along doing the scraping part. this all i have so far:
import webbrowser
query = input("What would you like to search: ")
for word in query:
query = query + "+"
webbrowser.open("https://www.google.com/search?q="+query)
Let's say they say type: "Who is donald trump?" Their browser will open and this will show: donald trump search result
How would i go along and scrape the summary provided by wikipedia and then have it be printed back to the user? Or in any case scrape any data from a website???
Solution 1:[1]
To scrape just summary you can use select_one() method provided by bs4 by selecting CSS selector. You can use the SelectorGadget Chrome extension or any other to make a quick selection.
Make sure you're using a user-agent, otherwise, Google could block your request because the default user-agent will be python-requests (if you were using requests library)
List of user-agents to fake user visit.
From there you can scrape every other part you want by using select_one() method. Keep in mind that you can scrape info from Knowladge graph only if Google provides it. You can make an if or try-except statement to handle exceptions.
Code and full example:
from bs4 import BeautifulSoup
import requests
import lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=who is donald trump', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = soup.select_one('.Uo8X3b+ span').text
print(summary)
Output:
Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.
An alternative way to do it using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan. Check out playground to see if it suits your needs.
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "who is donald trump",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
summary = results["knowledge_graph"]['description']
print(summary)
Output:
Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.
Disclaimer I work for SerpApi.
Solution 2:[2]
I have used selenium web driver. And extracted the google results snippets successfully.
from selenium import webdriver
browser = webdriver.Chrome(path\chromedriver')
#specify path of chrome driver
browser.get('http://google.co.in/')
sbar = browser.find_element_by_id('lst-ib')
sbar.send_keys(x) # x is the query
sbar.send_keys(Keys.ENTER)
#elements on search page of google are having different class and ids so we have to try among severals to get an answer.
try:
elem = browser.find_element_by_css_selector('div.MUxGbd.t51gnb.lyLwlc.lEBKkf')
except:
pass
try:
elem = browser.find_element_by_css_selector('span.ILfuVd.yZ8quc')
except:
pass
try:
elem = browser.find_element_by_css_selector('div.Z0LcW')
except:
pass
print (elem.text)
I hope it helps. If you find errors please let know! Ps. Take care of indentation
Note: you should have driver for the browser you will be using.
Solution 3:[3]
Above code works good except ID. with id="rhs_block" I don't get any results. Instead I used id="res". Maybe that's updated recently
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Naazneen Jatu |
| Solution 3 | Mayuri K |
