'BeautifulSoup4 doesn't read ≈ as an HTML entity

This seems familiar; why does ≈ not get picked up by html.parser?

>>> from bs4 import BeautifulSoup
>>> for html in ['hey ‘ 3','hey π','hey ≈ 3']:
...     print repr(unicode(BeautifulSoup(html,'html.parser')))
... 
u'hey \u2018 3'
u'hey \u03c0'
u'hey &approx 3'


Solution 1:[1]

I managed to figure it out myself from looking at the bs4 source code for the htmlparser builder.

BeautifulSoup's builder uses the entity-name-to-character mapping in bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER so it is easy to patch.

import bs4
from bs4 import BeautifulSoup

rawhtml = '<p>&lsquo; &approx; &pi; &theta; 3.</p>'

soup = BeautifulSoup(rawhtml, 'html.parser')
print('Before: %s' % repr(soup))

# &approx; -> \u2248
# from https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references    
bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER['approx'] = u'\u2248'

soup = BeautifulSoup(rawhtml, 'html.parser')
print('After: %s' % repr(soup))

which prints out

Before: <p>\u2018 &amp;approx \u03c0 \u03b8 3.</p>
After: <p>\u2018 \u2248 \u03c0 \u03b8 3.</p>

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jason S