'How to extract all the hrefs and src inside specific divs with beautifulsoup python
I want to extract all the href and src inside all the divs on the page that have class = 'news_item'
The html looks like this:
<div class="col">
<div class="group">
<h4>News</h4>
<div class="news_item">
<a href="www.link.com">
<h2 class="link">
here is a link-heading
</h2>
<div class="Img">
<img border="0" src="/image/link" />
</div>
<p></p>
</a>
</div>
from here what I want to extract is:
www.link.com , here is the link-heading and /image/link
My code is:
def scrape_a(url):
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
def scrape_headings(url):
for news_headings in soup.select("h2.link"):
return str(news_headings.string.strip())
def scrape_images(url):
images = soup.select("div.Img[src]")
for image in images:
if images:
return 'http://www.web.com' + news_links['src']
def top_stories():
r = requests.get(url)
soup = BeautifulSoup(r.content)
link = scrape_a(soup)
heading = scrape_headings(soup)
image = scrape_images(soup)
message = {'heading': heading, 'link': link, 'image': image}
print message
The problem is that it gives me error:
**TypeError: 'NoneType' object is not callable**
Here is the Traceback:
Traceback (most recent call last):
File "web_parser.py", line 40, in <module>
top_stories()
File "web_parser.py", line 32, in top_stories
link = scrape_a('www.link.com')
File "web_parser.py", line 10, in scrape_a
news_links = soup.select_all("div.news_item [href]")
Solution 1:[1]
Most of your errors come from the fact that the news_link is not being found properly. You aren't getting back the tag you expect.
Change:
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
to this and see if it helps:
news_links = soup.find_all("div", class="news_item")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links.find("a").get('href')
Also note that the return statement will give you something like http://www.web.comwww.link.com which I don't think you want.
Solution 2:[2]
Your idea to split the tasks into different methods is pretty good -
nice to read, to change and to reuse.
The errors are almost solved and fixed, in the trace there is select_all but its not in beautifulsoup and neither in your code and some other stuff ...long story short I would do it like this.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from urlparse import urljoin
import requests
def news_links(url, soup):
links = []
for text in soup.select("div.news_item"):
for x in text.find_all(href=True):
links.append(urljoin(url, x['href']))
return links
def news_headings(soup):
headings = []
for news_headings in soup.select("h2.link"):
heading.append(str(news_headings.string.strip()))
return headings
def news_images(url, soup):
sources = []
for image in soup.select("img[src]"):
sources.append(urljoin(url, image['src']))
return sources
def top_stories():
url = 'http://www.web.com/'
r = requests.get(url)
content = r.content
soup = BeautifulSoup(content)
message = {'heading': news_headings(soup),
'link': news_links(url, soup),
'image': news_images(url, soup)}
return message
print top_stories()
Soup is robust, you want to find or select something that is not there it returns an empty list. It looks like you parsing a list of items - the code is pretty close to be used for this.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | dstudeba |
| Solution 2 |
