'href inside "Load more" button doesn't bring more articles when pasting URL

I'm trying to scrape this site:

https://noticias.caracoltv.com/colombia

At the end you can find a "Cargar Más" button, that brings more news. So far so good. But, when inspecting that element it says it loads a link like this: https://noticias.caracoltv.com/colombia?00000172-8578-d277-a9f3-f77bc3df0000-page=2, as seen here:

enter image description here

The thing is, if I enter this into my browser, I get the same news I get if I just call the original website. Because of this, the only way I'm seeing I would be able to scrape the website is to create a script that recursively clicks. The thing is I need news until 2019, so it doesn't seem very feasible.

Also, when checking the event listeners I see this:

enter image description here

But I'm not sure how can I use that to my advantage.

Am I missing something? Is there any way to access older news through a link (or an API would be even better, but I didn't find any call to an API).

I'm currently using Python to scrape, but I'm in the investigation stage, so there's no code to show that's meaningful. Thanks a lot!



Solution 1:[1]

Unfortunately, it doesn't seem like there's a nice api or other way to get the data in bulk. You will need to iterate through the "pages"/"load more" and then parse the html.

Will take a little time, but this code will do that for you.

import requests 
from bs4 import BeautifulSoup
import re
import json
from dateutil import parser
import datetime
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

rows = []
endYear = 2019
continueLoop = True
page = 1
while continueLoop == True:

    url = f'https://noticias.caracoltv.com/colombia?00000172-8578-d277-a9f3-f77bc3df0000-page={page}'
    response = requests.get(url, headers=headers)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    if page == 1:
        mainArticle = soup.find('div', {'class':'ListU'})
        jsonStr = str(mainArticle.find('script'))
        jsonStr = re.search('({.*})', jsonStr).group(1)
        jsonData = json.loads(jsonStr)
        
        articlePublished = jsonData['datePublished']
        dt = parser.parse(articlePublished)
        
        print(f"{dt.day}-{dt.month}-{dt.year} : {jsonData['headline']}")
        rows.append(jsonData)
        
        subArticles = soup.find_all('li', {'class':'ListE-items-item'})
        for subArticle in subArticles:
            jsonStr = str(subArticle.find('script'))
            jsonStr = re.search('({.*})', jsonStr).group(1)
            jsonData = json.loads(jsonStr)
            
            articlePublished = jsonData['datePublished']
            dt = parser.parse(articlePublished)
            
            print(f"{dt.day}-{dt.month}-{dt.year} : {jsonData['headline']}")
            rows.append(jsonData)

    #get bottom articles
    loadMore = soup.find('ps-list-loadmore', {'class':'ListD'})
    articles = loadMore.find_all('li',{'class':'ListD-items-item'})
    for article in articles:
        if continueLoop == True:
            try:
                jsonStr = str(article.find('script'))
                jsonStr = re.search('({.*})', jsonStr).group(1)
                jsonData = json.loads(jsonStr)
                
                articlePublished = jsonData['datePublished']
                dt = parser.parse(articlePublished)
            except:
                headline = article.find('a', href=True)['title']
                articlePublished = article.find('div', {'PromoB-timestamp'})['data-timestamp']
                url = article.find('a', href=True)['href']
                
                jsonData = {
                    'headline':headline,
                    'datePublished':articlePublished,
                    'url':url}
            
            
            if endYear == dt.year:
                print('Done')
                continueLoop = False
            
            else:
                print(f"{dt.day}-{dt.month}-{dt.year} : {jsonData['headline']}")
                rows.append(jsonData)
            
    page += 1


df = pd.DataFrame(rows)

Output:

print(df)
               @context  ...                                              video
0     http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...
1     http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...
2     http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...
3     http://schema.org  ...                                                NaN
4     http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...
                ...  ...                                                ...
6864  http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...
6865  http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...
6866  http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...
6867  http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...
6868  http://schema.org  ...  {'@context': 'http://schema.org', '@type': 'Vi...

[6869 rows x 15 columns]


print(df.columns)
Index(['@context', '@type', 'headline', 'description', 'articleBody',
       'articleSection', 'url', 'mainEntityOfPage', 'datePublished',
       'dateModified', 'author', 'publisher', 'image', 'keywords', 'video'],
      dtype='object')

Solution 2:[2]

Chech Query String format @ wiki, please.

You missing a & mark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Jack Ting