'Python Pandas Data Frame: One column contains special HTML spcial characters such as & < Is there a way to remove them?
Solution 1:[1]
found this https://gist.github.com/codeboy/5487eeb1c551d59e2366 which does slightly more than you're asking, so i modified it to this:
import re
def parse_text(text, patterns=None):
"""
modified from above github gist
delete all HTML entities
:param text (str): given text
:param patterns (dict): patterns for re.sub
:return str: final text
"""
base_patterns = {"&[rl]dquo;": "",
"&[rl]squo;": "",
" ": "",
"&": ""}
patterns = patterns or base_patterns
final_text = text
for pattern, repl in patterns.items():
final_text = re.sub(pattern, repl, final_text)
return final_text
you can call it like this, assigning to a new column so you can compare the result to the original string:
df["column3"] = df["column2"].apply(parse_text)
please note that the patterns variable is probably not complete, and you may have to augment it based on what you have in your escaped HTML.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mechanical_meat |


