'issue with parsing wiki.js webpage's HTML content using beautifulsoup
I am using beautifulsoup python module to parse HTML content of a wiki.js based webpage. However, I am having trouble extracting the text component of the header and paragraph tags.
I have tried .getText() method and .text property, but wasn't able to extract the text from the header/paragraph tags.
Below is the code snippet for reference:
import requests
from bs4 import BeautifulSoup
# a random webpage built using wiki.js
url = "https://brgswiki.org/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
heading_tags = ["h1","h2"]
for tags in soup.find_all(heading_tags):
print("=============================================")
print(f"complete Header Tag with the text:\n{tags}")
print("=============================================")
print("just header tag_name and header text_content")
print(tags.name + ' -> ' + tags.text.strip())
And here's the output:
=============================================
complete Header Tag with the text:
<h2 class="toc-header" id="subscribe-to-our-new-newsletter"><a class="toc-anchor" href="#subscribe-to-our-new-newsletter">¶</a> <em>Subscribe to our new newsletter!</em></h2>
=============================================
just header tag_name and header text_content
h2 ->
As you see in the output the h2 tag text -"Subscribe to our new newsletter!" is not being extracted
I see this issue with just the webpages built on wiki.js, the other webpages work just fine.
Any suggestion/guidance on how to get around this issue is appreciated.
Thank you.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
