'Python Webscrape Google Custom Search URL with Parameters

I am trying to do a project where I search for similar images using Google Image stuff and Google's Custom Search API. From that, I get the correct URL that gets me similar images. Then, I simply want the HTML of that page. The page looks like this LINK. I just want the HTML to the page this leads to. But, I tried this:

r = requests.get(fetchUrl)

print(r.text)

This is just the HTML to a really old Google main page. I am not sure where this is coming from. I also tried adding a header to ensure that Google doesn't block me from scraping.

Entire code:

import requests

filePath = 'Initial_Img/a/frame1.jpg'
searchUrl = 'http://www.google.com/searchbyimage/upload'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']

print(fetchUrl)

Do you have any ideas? Any help is truly appreciated.



Solution 1:[1]

The problem is something with the way Google renders the page. You would have to use Selenium and physically use the web browser to get the HTML. To solve your problem:

Run: sudo apt install firefox-geckodriver and install Firefox

Run: pip install selenium

Change your code to this:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

filePath = 'Initial_Img/a/test.jpg'
searchUrl = 'http://www.google.com/searchbyimage/upload'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}

response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']

options = Options()
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox") # linux only
options.add_argument("--headless")
options.headless = True # also works
nav = webdriver.Firefox(options=options)
nav.get(fetchUrl)
print(nav.page_source)

nav.page_source gets you the HTML of the end page. I hope this helps. I don't know why the normal method doesn't work. If anyone knows the reason, please comment below.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DragonflyRobotics