'There is a problem in python web-scraping [closed]

I wrote a code to scrape datas from a site but there is a problem. The site it is a news portal.

        articleIndex = 0
        for div in mainPage_soup.findAll('div', attrs={'class':'title'}):
            if(articleIndex<2):                
                article = requests.get(article_url)
                article_soup = BeautifulSoup(article.content, "html.parser") 
           
                d=""
                date_soup = BeautifulSoup(html)
                d=date_soup.find('time', class_='article-datetime').get_text()
                print(d)

                article_content_str = ""
                text = article_soup.find('div', class_='article-content entry-content')
                for item in text.find_all('p'):
                    text = "#" + item.text
                    article_content_str += text                

The site name: hvg.hu
I get a nontype error with date and p-s. The Date is the article realase date And the P get the article text by sentences.

I tried a lot about the date. normal text, get_text but nothing work.

It works (if I write out the class names) with a different sites.

I don't know where is the problem.
Maybe I chose wrong divs?



Solution 1:[1]

There is no one fits all solution, so you have to decide on case.

  • Check wich links you like to scrape and how they differ:

    mainPage_soup.select('h1 a[title][href]')
    
  • Check on each article page, if there are the elements you expect (walrus operator needs python 3.8 and later):

    if (t := article_soup.find('time', class_='article-datetime')):
        time = t.get_text(strip=True)
    elif (t := article_soup.select_one('label:-soup-contains("megjelent") + p')):
        time = t.get_text(strip=True)
    else:
       time = None
    

    else use regular if/else statement:

    if article_soup.find('time', class_='article-datetime'):
        time = article_soup.find('time', class_='article-datetime').get_text(strip=True)
    elif article_soup.select_one('label:-soup-contains("megjelent") + p'):
        time = article_soup.select_one('label:-soup-contains("megjelent") + p').get_text(strip=True)
    else:
       time = None
    
Example

Sliced to first 20 results, skip [:20] from for loop if you like to scrape more, but be gentle and add some delay between your iterations:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://hvg.hu'

headers = ({'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})

r = requests.get(url, headers=headers)
mainPage_soup = BeautifulSoup(r.content)
data = []

for a in mainPage_soup.select('h1 a[title][href]')[:20]:
    if 'http' in a.get('href'):
        article = requests.get(a.get('href'))
    else:
        article = requests.get(url+a.get('href'))
    
    article_soup = BeautifulSoup(article.content) 
    
    if (t := article_soup.find('time', class_='article-datetime')):
        time = t.get_text(strip=True)
    elif (t := article_soup.select_one('label:-soup-contains("megjelent") + p')):
        time = t.get_text(strip=True)
    else:
        time = None

    if (t := article_soup.select('.article p')):
        text = t
    elif (t := article_soup.select('.article-content p')):
        text = t
    else:
        text = []

    data.append({
        'time': time,
        'text': ' '.join([p.get_text(strip=True) for p in text]),
        'url': url+a.get('href')
    })

print(data)
#or pd.DataFrame(data).to_csv('yourFile.csv', index=False)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1