'Decoding HTML entities with Python
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.
Take for example:
"U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success.
Solution 1:[1]
>>> from HTMLParser import HTMLParser
>>> print HTMLParser().unescape('U.S. Adviser’s Blunt Memo on Iraq: '
... 'Time ‘to Go Home’')
U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’
The function is undocumented in Python 2. It is fixed in Python 3.4+: it is exposed as html.unescape() there.
Solution 2:[2]
Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example     all mean U+00A0 NO-BREAK SPACE.
  (the type you have) is a "numeric character reference" (decimal).  is a "numeric character reference" (hexadecimal). is an entity.
Further reading: http://htmlhelp.com/reference/html40/entities/
Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html
Solution 3:[3]
This does work:
from BeautifulSoup import BeautifulStoneSoup
s = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
If you want a string instead of a Unicode object, you'll need to decode it to an encoding that supports the characters being used; ISO-8859-1 doesn't:
result = decoded.encode("UTF-8")
It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Florian |
| Solution 3 |
