'BeautifulSoup4 doesn't read ≈ as an HTML entity
This seems familiar; why does ≈ not get picked up by html.parser?
>>> from bs4 import BeautifulSoup
>>> for html in ['hey ‘ 3','hey π','hey ≈ 3']:
... print repr(unicode(BeautifulSoup(html,'html.parser')))
...
u'hey \u2018 3'
u'hey \u03c0'
u'hey &approx 3'
Solution 1:[1]
I managed to figure it out myself from looking at the bs4 source code for the htmlparser builder.
BeautifulSoup's builder uses the entity-name-to-character mapping in bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER so it is easy to patch.
import bs4
from bs4 import BeautifulSoup
rawhtml = '<p>‘ ≈ π θ 3.</p>'
soup = BeautifulSoup(rawhtml, 'html.parser')
print('Before: %s' % repr(soup))
# ≈ -> \u2248
# from https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER['approx'] = u'\u2248'
soup = BeautifulSoup(rawhtml, 'html.parser')
print('After: %s' % repr(soup))
which prints out
Before: <p>\u2018 &approx \u03c0 \u03b8 3.</p>
After: <p>\u2018 \u2248 \u03c0 \u03b8 3.</p>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jason S |
