'issue with parsing wiki.js webpage's HTML content using beautifulsoup

I am using beautifulsoup python module to parse HTML content of a wiki.js based webpage. However, I am having trouble extracting the text component of the header and paragraph tags.

I have tried .getText() method and .text property, but wasn't able to extract the text from the header/paragraph tags.

Below is the code snippet for reference:

import requests
from bs4 import BeautifulSoup

# a random webpage built using wiki.js
url = "https://brgswiki.org/" 
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

heading_tags = ["h1","h2"]

for tags in soup.find_all(heading_tags):
    print("=============================================")
    print(f"complete Header Tag with the text:\n{tags}")
    print("=============================================")
    print("just header tag_name and header text_content")
    print(tags.name + ' -> ' + tags.text.strip())

And here's the output:

=============================================
complete Header Tag with the text:
<h2 class="toc-header" id="subscribe-to-our-new-newsletter"><a class="toc-anchor" href="#subscribe-to-our-new-newsletter">¶</a> <em>Subscribe to our new newsletter!</em></h2>
=============================================
just header tag_name and header text_content
h2 ->

As you see in the output the h2 tag text -"Subscribe to our new newsletter!" is not being extracted

I see this issue with just the webpages built on wiki.js, the other webpages work just fine.

Any suggestion/guidance on how to get around this issue is appreciated.

Thank you.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source