'Extract everything inside tag, but not tag itself

I'm using BeautifulSoup to scrape text from a website, but I only want the <p> tags for organization. However, I can't use text.findAll('p'), because there are other <p> tags that I don't want.

The text I want is all wrapped inside one tag (let's say body), but when I parse it, it takes also includes that tag.

link = requests.get('link')
text = bs4.BeautifulSoup(link.text, 'html.parser').find('body')

How would I remove the body tag?



Solution 1:[1]

text = bs4.BeautifulSoup(link.text, 'html.parser').find('body').text

This will concatenate all the text in the body tag.

Solution 2:[2]

If you want everything in the tag (including HTML), but not the tag itself, you can use the decode_contents method of the Tag class. This will render the contents of the tag as a Unicode string

>>> html = """
<body>
<p>Hello <b>World</b></p>
<p>Hello again</p>
</body>
"""

>>> body = bs4.BeautifulSoup(html, 'html.parser').find('body')

>>> body.decode_contents()
'\n<p>Hello <b>World</b></p>\n<p>Hello again</p>\n'

I'm not sure if that's exactly what you're asking for because the question was a little ambiguous so here are the other similar options that you or others may be seeking:

>>> body.text
'\nHello World\nHello again\n'

>>> str(body)
'<body>\n<p>Hello <b>World</b></p>\n<p>Hello again</p>\n</body>'

>>> body.contents
['\n', <p>Hello <b>World</b></p>, '\n', <p>Hello again</p>, '\n']

>>> [p.text for p in body.find_all('p')]
['Hello World', 'Hello again']

>>> list(body.strings)
['\n', 'Hello ', 'World', '\n', 'Hello again', '\n']

Solution 3:[3]

This may help you:

>>> txt = """\
<p>Rahul</p>
<p><i>White</i></p>
<p>City <b>Beston</b></p>
"""

>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Rahul
White
City Beston

OR you can do this:

soup = BeautifulSoup(html)
bodyTag = soup.find('body')
bodyText = BeautifulSoup(bodyTag, "html.parser")
print bodyText.strings

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 宏杰李
Solution 2 Nala Nkadi
Solution 3 Piyush S. Wanare