'insert whitespace when stripping html tags using lxml
I want to insert whitespace into the resulting text when I strip tags and extract text using lxml
I don't really know lxml. Via this answer (which seems based on a comment on the same page from @bluu), I have the following:
import lxml
def strip_html(s):
return str(lxml.html.fromstring(s).text_content())
When I try it with this:
strip_html("<p>This what you want.</p><p>This what you get.</p>")
I get this:
'This what you want.This what you get.'
But I want this:
'This what you want. This what you get.'
What I really want is the equivalent of this:
from bs4 import BeautifulSoup
s = "<p>This what you want.</p><p>This what you get.</p>"
BeautifulSoup(s, "lxml").get_text(separator=" ")
which does give the desired output - for all tags - but I want to avoid the amazing BeautifulSoup in this case
I also want it to work for all tags, and without my having to spell out all the tags, or loop and search for particular characters etc
I have looked at the code of bs4's element.py to try to adapt the separator and I see it's not a simple matter
I was also looking at lxml.html.clean as in this answer
Solution 1:[1]
You could select all tags that contains text iterate over these and join() the ResultSet by seperator:
s = "<p>This what you want.</p><p>This what you get.</p>"
' '.join([e.text_content() for e in lxml.html.document_fromstring(s).xpath("//*[text()]")])
Example
import lxml
def strip_html(s):
return ' '.join([e.text_content() for e in lxml.html.document_fromstring(s).xpath("//*[text()]")])
strip_html("<p>This what you want.</p><p>This what you get.</p>")
Output
This what you want. This what you get.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | HedgeHog |
