'Replacing Unicode Characters with actual symbols
string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
I want to get rid of the <U + 2019> and replace it with '. Is there a way to do this in python?
Edit : I also have instances of <U + 2014>, <U + 201C> etc. Looking for something which can replace all of this with appropriate values
Solution 1:[1]
I guess this solves the problem if its just one or two of these characters.
>>> string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
>>> string.replace("<U+2019>","'")
"At Donald Trump's Properties, a Showcase for a Brand and a President-Elect"
If there are many if these substitutions to be done, consider using 'map()' method.
Source: Removing \u2018 and \u2019 character
Solution 2:[2]
You can replace using .replace()
print(string.replace('<U+2019>', "'"))
Or if your year changes, you can use re. But make it more attractive than mine.
import re
string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
rep = re.search('[<][U][+]\d{4}[>]', string).group()
print(string.replace(rep, "'"))
Solution 3:[3]
Here's my solution for all code points denoted as U+0000 through U+10FFFF ("U+" followed by the code point value in hexadecimal, which is prepended with leading zeros to a minimum of four digits):
import re
def UniToChar(unicode_notation):
return chr(int(re.findall(r'<U\+([a-hA-H0-9]{4,5})>',unicode_notation)[0],16))
xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
'''
for x in xx.split('\n'):
abc = re.findall(r'<U\+[a-hA-H0-9]{4,5}>',x)
if len(abc) > 0:
for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
print(repr(x).strip("'"))
Output: 71307293.py
At Donald’s ?Elect? in ?2019? À la Donald’s friend ?. ???
In fact, private range from U+100000 to U+10FFFD (Plane 16) isn't detected using above simplified regex… Improved code follows:
import re
def UniToChar(unicode_notation):
aux = int(re.findall(r'<U\+([a-hA-H0-9]{4,6})>',unicode_notation)[0],16)
# circumvent the "ValueError: chr() arg not in range(0x110000)"
if aux <= 0x10FFFD:
return chr(aux)
else:
return chr(0xFFFD) # Replacement Character
xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
Unassigned: <U+05ff>; out of Unicode range: <U+110000>.
'''
for x in xx.split('\n'):
abc = re.findall(r'<U\+[a-hA-H0-9]{4,6}>',x)
if len(abc) > 0:
for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
print(repr(x).strip("'"))
Output: 71307293.py
At Donald’s ?Elect? in ?2019? À la Donald’s friend ?. ??? Unassigned: \u05ff; out of Unicode range: ?.
Solution 4:[4]
what version of python are u using?
I edited my answer so it can bee used with multiple code point in the same string
well u need to convert the unicode's code point that is between < >, to unicode char
I used regex to get the unicode's code point and then convert it to the corresponding uniode char
import re
string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President<U+2014>Elect"
repbool = re.search('[<][U][+]\d{4}[>]', string)
while repbool:
rep = re.search('[<][U][+]\d{4}[>]', string).group()
string=string.replace(rep, chr(int(rep[1:-1][2:], 16)))
repbool = re.search('[<][U][+]\d{4}[>]', string)
print(string)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | CrYbAbY |
| Solution 2 | |
| Solution 3 | |
| Solution 4 |
