'How to webscrape with Python keeping meta-information in the text?

I am trying to webscrape this website. To do so, I run the following code:

from bs4 import BeautifulSoup
import requests

url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)

data = []

for link in soup.select('div.title > a'):
    soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{link['href']}").content)
    data.append({
        'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
    })

print(data)

This works fine. What is the issue? The problem is that the webscraped text comes without information on paragraphs split and on bold character. This is a problem since I would then need to make some calls on the basis of that.

Can anyone suggest how to maintain meta-information in the text?

Thanks a lot!



Solution 1:[1]

A solution is to determine in the website code source what are the markers for paragraphs split and bold characters.

Then, the "soup" variable, you can localize what interests you using the markers as a string to be searched in "soup".

Looking briefly at the source code of your website, I think the answer relies in following markers (I needed to add ' otherwise the markers are hidden by stackoverflow):

"<'/a><'/div><'div class="subtitle">"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 totalMongot