'insert whitespace when stripping html tags using lxml

I want to insert whitespace into the resulting text when I strip tags and extract text using lxml

I don't really know lxml. Via this answer (which seems based on a comment on the same page from @bluu), I have the following:

import lxml

def strip_html(s):
    return str(lxml.html.fromstring(s).text_content())

When I try it with this:

strip_html("<p>This what you want.</p><p>This what you get.</p>")

I get this:

'This what you want.This what you get.'

But I want this:

'This what you want. This what you get.'

What I really want is the equivalent of this:

from bs4 import BeautifulSoup

s = "<p>This what you want.</p><p>This what you get.</p>"

BeautifulSoup(s, "lxml").get_text(separator=" ")

which does give the desired output - for all tags - but I want to avoid the amazing BeautifulSoup in this case

I also want it to work for all tags, and without my having to spell out all the tags, or loop and search for particular characters etc

I have looked at the code of bs4's element.py to try to adapt the separator and I see it's not a simple matter

I was also looking at lxml.html.clean as in this answer



Solution 1:[1]

You could select all tags that contains text iterate over these and join() the ResultSet by seperator:

s = "<p>This what you want.</p><p>This what you get.</p>"
' '.join([e.text_content() for e in lxml.html.document_fromstring(s).xpath("//*[text()]")])
Example
import lxml

def strip_html(s):
    return ' '.join([e.text_content() for e in lxml.html.document_fromstring(s).xpath("//*[text()]")])

strip_html("<p>This what you want.</p><p>This what you get.</p>")
Output
This what you want. This what you get.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 HedgeHog