'There is a problem in python web-scraping [closed]
I wrote a code to scrape datas from a site but there is a problem. The site it is a news portal.
articleIndex = 0
for div in mainPage_soup.findAll('div', attrs={'class':'title'}):
if(articleIndex<2):
article = requests.get(article_url)
article_soup = BeautifulSoup(article.content, "html.parser")
d=""
date_soup = BeautifulSoup(html)
d=date_soup.find('time', class_='article-datetime').get_text()
print(d)
article_content_str = ""
text = article_soup.find('div', class_='article-content entry-content')
for item in text.find_all('p'):
text = "#" + item.text
article_content_str += text
The site name: hvg.hu
I get a nontype error with date and p-s.
The Date is the article realase date
And the P get the article text by sentences.
I tried a lot about the date. normal text, get_text but nothing work.
It works (if I write out the class names) with a different sites.
I don't know where is the problem.
Maybe I chose wrong divs?
Solution 1:[1]
There is no one fits all solution, so you have to decide on case.
Check wich links you like to scrape and how they differ:
mainPage_soup.select('h1 a[title][href]')Check on each article page, if there are the elements you expect (walrus operator needs python 3.8 and later):
if (t := article_soup.find('time', class_='article-datetime')): time = t.get_text(strip=True) elif (t := article_soup.select_one('label:-soup-contains("megjelent") + p')): time = t.get_text(strip=True) else: time = Noneelse use regular if/else statement:
if article_soup.find('time', class_='article-datetime'): time = article_soup.find('time', class_='article-datetime').get_text(strip=True) elif article_soup.select_one('label:-soup-contains("megjelent") + p'): time = article_soup.select_one('label:-soup-contains("megjelent") + p').get_text(strip=True) else: time = None
Example
Sliced to first 20 results, skip [:20] from for loop if you like to scrape more, but be gentle and add some delay between your iterations:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://hvg.hu'
headers = ({'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
r = requests.get(url, headers=headers)
mainPage_soup = BeautifulSoup(r.content)
data = []
for a in mainPage_soup.select('h1 a[title][href]')[:20]:
if 'http' in a.get('href'):
article = requests.get(a.get('href'))
else:
article = requests.get(url+a.get('href'))
article_soup = BeautifulSoup(article.content)
if (t := article_soup.find('time', class_='article-datetime')):
time = t.get_text(strip=True)
elif (t := article_soup.select_one('label:-soup-contains("megjelent") + p')):
time = t.get_text(strip=True)
else:
time = None
if (t := article_soup.select('.article p')):
text = t
elif (t := article_soup.select('.article-content p')):
text = t
else:
text = []
data.append({
'time': time,
'text': ' '.join([p.get_text(strip=True) for p in text]),
'url': url+a.get('href')
})
print(data)
#or pd.DataFrame(data).to_csv('yourFile.csv', index=False)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
