'How to retrieve yahoo search results?

I am trying to search Yahoo for a query using this code:

import requests
from bs4 import BeautifulSoup

query = "deep"
yahoo = "https://search.yahoo.com/search?q=" + query + "&n=" + str(10)
raw_page = requests.get(yahoo)

soup = BeautifulSoup(raw_page.text)

for link in soup.find_all(attrs={"class": "ac-algo fz-l ac-21th lh-24"}):
    print (link.text, link.get('href'))

But this does not work and the result is empty. How can I get 10 first search results?



Solution 1:[1]

Here are the main problems with your code:

  • When using Beautiful soup you should always include a parser (e.g. BeautifulSoup(raw_page.text, "lxml"))

  • You were searching for the wrong class, it's " ac-algo fz-l ac-21th lh-24" not "ac-algo fz-l ac-21th lh-24" (notice the space in the begining)

All in all your code should look like this:

import requests
from bs4 import BeautifulSoup

query = "deep"
yahoo = "https://search.yahoo.com/search?q=" + query + "&n=" + str(10)
raw_page = requests.get(yahoo)

soup = BeautifulSoup(raw_page.text, "lxml")
for link in soup.find_all(attrs={"class": " ac-algo fz-l ac-21th lh-24"}):
    print(link.text, link.get('href'))

Hope this helps

Solution 2:[2]

You can use Css Selector to find all links which is must faster.

import requests
from bs4 import BeautifulSoup

query = "deep"
yahoo = "https://search.yahoo.com/search?q=" + query + "&n=" + str(10)
raw_page = requests.get(yahoo)

soup = BeautifulSoup(raw_page.text,'lxml')

for link in soup.select(".ac-algo.fz-l.ac-21th.lh-24"):
    print (link.text, link['href'])

Output :

(Deep | Definition of Deep by Merriam-Webster', 'https://www.merriam-webster.com/dictionary/deep')
(Connecticut Department of Energy & Environmental Protection', 'https://www.ct.gov/deep/site/default.asp')
(Deep | Define Deep at Dictionary.com', 'https://www.dictionary.com/browse/deep')
(Deep - definition of deep by The Free Dictionary', 'https://www.thefreedictionary.com/deep')
(Deep (2017) - IMDb', 'https://www.imdb.com/title/tt4105584/')
(Deep Synonyms, Deep Antonyms | Merriam-Webster Thesaurus', 'https://www.merriam-webster.com/thesaurus/deep')
(Deep Synonyms, Deep Antonyms | Thesaurus.com', 'https://www.thesaurus.com/browse/deep')
(DEEP: Fishing - Connecticut', 'https://www.ct.gov/deep/cwp/view.asp?q=322708')
(Deep Deep Deep - YouTube', 'https://www.youtube.com/watch?v=oZhwagxWzOc')
(deep - English-Spanish Dictionary - WordReference.com', 'https://www.wordreference.com/es/translation.asp?tranword=deep')

Solution 3:[3]

If you try to print the text of the current selector, you will get the combined text of the child <span> element as well, which is not what you want. To do that, you can use the h3 > a selector.

An illustration of what to get

However, when using BeautifulSoup, CSS selector's behavior could be weird, for example:

# bs4

for result in soup.select(".compTitle.options-toggle"):
    text = result.select_one("h3 > a").text
    print(text)
 
#                                                     ?      ?      ?                          
# www.merriam-webster.com › dictionary › deepDeep Definition & Meaning - Merriam-Webster
# parsel

for result in soup.css(".compTitle.options-toggle"):
    text = result.css("h3 > a::text").get()
    print(text)

# Deep Definition & Meaning - Merriam-Webster

This could be because parsel translates every CSS selector query to XPath but I'm not sure what is causing such behavior in BeautifulSoup.

Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.

Code and full example in online IDE:

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "p": "deep"
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36",
}

html = requests.get("https://search.yahoo.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

for result in soup.select(".compTitle.options-toggle"):
    url = result.select_one("a")["href"]
    text = list(result.select_one("a").text)

    for i in range(len(result.select_one("span").text)):
        text.pop(0)
        
    print("".join(text), url, sep="\n")

Output:

DeepL Translate: The world's most accurate translator
https://r.search.yahoo.com/_ylt=Awr9GjGgZERiFfEANhtXNyoA;_ylu=Y29sbwNncTEEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1648678176/RO=10/RU=https%3a%2f%2fwww.deepl.com%2ftranslator/RK=2/RS=TVz6fqq87B12oa7dLig44PkzAJs-
... other results

Using parsel:

from parsel import Selector
import requests

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "p": "deep"
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36",
}

html = requests.get("https://search.yahoo.com/search", params=params, headers=headers, timeout=30)
soup = Selector(text=html.text)

for result in soup.css(".compTitle.options-toggle"):
    url = result.css("a::attr(href)").get()
    text = result.css("h3 > a::text").get()
    print(text, url, sep="\n")

Output:

DeepL Translate: The world's most accurate translator
https://r.search.yahoo.com/_ylt=Awr9GjGgZERiFfEANhtXNyoA;_ylu=Y29sbwNncTEEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1648678176/RO=10/RU=https%3a%2f%2fwww.deepl.com%2ftranslator/RK=2/RS=TVz6fqq87B12oa7dLig44PkzAJs-
... other results

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 KunduK
Solution 3