'href inside "Load more" button doesn't bring more articles when pasting URL
I'm trying to scrape this site:
https://noticias.caracoltv.com/colombia
At the end you can find a "Cargar Más" button, that brings more news. So far so good. But, when inspecting that element it says it loads a link like this: https://noticias.caracoltv.com/colombia?00000172-8578-d277-a9f3-f77bc3df0000-page=2, as seen here:
The thing is, if I enter this into my browser, I get the same news I get if I just call the original website. Because of this, the only way I'm seeing I would be able to scrape the website is to create a script that recursively clicks. The thing is I need news until 2019, so it doesn't seem very feasible.
Also, when checking the event listeners I see this:
But I'm not sure how can I use that to my advantage.
Am I missing something? Is there any way to access older news through a link (or an API would be even better, but I didn't find any call to an API).
I'm currently using Python to scrape, but I'm in the investigation stage, so there's no code to show that's meaningful. Thanks a lot!
Solution 1:[1]
Unfortunately, it doesn't seem like there's a nice api or other way to get the data in bulk. You will need to iterate through the "pages"/"load more" and then parse the html.
Will take a little time, but this code will do that for you.
import requests
from bs4 import BeautifulSoup
import re
import json
from dateutil import parser
import datetime
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
rows = []
endYear = 2019
continueLoop = True
page = 1
while continueLoop == True:
url = f'https://noticias.caracoltv.com/colombia?00000172-8578-d277-a9f3-f77bc3df0000-page={page}'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
if page == 1:
mainArticle = soup.find('div', {'class':'ListU'})
jsonStr = str(mainArticle.find('script'))
jsonStr = re.search('({.*})', jsonStr).group(1)
jsonData = json.loads(jsonStr)
articlePublished = jsonData['datePublished']
dt = parser.parse(articlePublished)
print(f"{dt.day}-{dt.month}-{dt.year} : {jsonData['headline']}")
rows.append(jsonData)
subArticles = soup.find_all('li', {'class':'ListE-items-item'})
for subArticle in subArticles:
jsonStr = str(subArticle.find('script'))
jsonStr = re.search('({.*})', jsonStr).group(1)
jsonData = json.loads(jsonStr)
articlePublished = jsonData['datePublished']
dt = parser.parse(articlePublished)
print(f"{dt.day}-{dt.month}-{dt.year} : {jsonData['headline']}")
rows.append(jsonData)
#get bottom articles
loadMore = soup.find('ps-list-loadmore', {'class':'ListD'})
articles = loadMore.find_all('li',{'class':'ListD-items-item'})
for article in articles:
if continueLoop == True:
try:
jsonStr = str(article.find('script'))
jsonStr = re.search('({.*})', jsonStr).group(1)
jsonData = json.loads(jsonStr)
articlePublished = jsonData['datePublished']
dt = parser.parse(articlePublished)
except:
headline = article.find('a', href=True)['title']
articlePublished = article.find('div', {'PromoB-timestamp'})['data-timestamp']
url = article.find('a', href=True)['href']
jsonData = {
'headline':headline,
'datePublished':articlePublished,
'url':url}
if endYear == dt.year:
print('Done')
continueLoop = False
else:
print(f"{dt.day}-{dt.month}-{dt.year} : {jsonData['headline']}")
rows.append(jsonData)
page += 1
df = pd.DataFrame(rows)
Output:
print(df)
@context ... video
0 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
1 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
2 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
3 http://schema.org ... NaN
4 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
... ... ...
6864 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
6865 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
6866 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
6867 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
6868 http://schema.org ... {'@context': 'http://schema.org', '@type': 'Vi...
[6869 rows x 15 columns]
print(df.columns)
Index(['@context', '@type', 'headline', 'description', 'articleBody',
'articleSection', 'url', 'mainEntityOfPage', 'datePublished',
'dateModified', 'author', 'publisher', 'image', 'keywords', 'video'],
dtype='object')
Solution 2:[2]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Jack Ting |


