'Replacing Unicode Characters with actual symbols

string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"

I want to get rid of the <U + 2019> and replace it with '. Is there a way to do this in python?

Edit : I also have instances of <U + 2014>, <U + 201C> etc. Looking for something which can replace all of this with appropriate values

python unicode

Solution 1:^[1]

I guess this solves the problem if its just one or two of these characters.

>>> string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
>>> string.replace("<U+2019>","'")
"At Donald Trump's Properties, a Showcase for a Brand and a President-Elect"

If there are many if these substitutions to be done, consider using 'map()' method.
Source: Removing \u2018 and \u2019 character

Solution 2:^[2]

You can replace using .replace()

print(string.replace('<U+2019>', "'"))

Or if your year changes, you can use re. But make it more attractive than mine.

import re

string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"

rep = re.search('[<][U][+]\d{4}[>]', string).group()

print(string.replace(rep, "'"))

Solution 3:^[3]

Here's my solution for all code points denoted as U+0000 through U+10FFFF ("U+" followed by the code point value in hexadecimal, which is prepended with leading zeros to a minimum of four digits):

import re
def UniToChar(unicode_notation):
    return chr(int(re.findall(r'<U\+([a-hA-H0-9]{4,5})>',unicode_notation)[0],16))

xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
'''
for x in xx.split('\n'):
    abc =  re.findall(r'<U\+[a-hA-H0-9]{4,5}>',x)
    if len(abc) > 0:
        for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
    
    print(repr(x).strip("'"))

Output: 71307293.py

At Donald’s ?Elect? in ?2019?
À la Donald’s friend ?. ???

In fact, private range from U+100000 to U+10FFFD (Plane 16) isn't detected using above simplified regex… Improved code follows:

import re
def UniToChar(unicode_notation):
    aux = int(re.findall(r'<U\+([a-hA-H0-9]{4,6})>',unicode_notation)[0],16)
    # circumvent the "ValueError: chr() arg not in range(0x110000)"
    if aux <= 0x10FFFD:
        return chr(aux)
    else:
        return chr(0xFFFD) # Replacement Character

xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
Unassigned: <U+05ff>; out of Unicode range: <U+110000>.
'''
for x in xx.split('\n'):
    abc =  re.findall(r'<U\+[a-hA-H0-9]{4,6}>',x)
    if len(abc) > 0:
        for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
    
    print(repr(x).strip("'"))

Output: 71307293.py

At Donald’s ?Elect? in ?2019?
À la Donald’s friend ?. ???
Unassigned: \u05ff; out of Unicode range: ?.

Solution 4:^[4]

what version of python are u using?

I edited my answer so it can bee used with multiple code point in the same string

well u need to convert the unicode's code point that is between < >, to unicode char

I used regex to get the unicode's code point and then convert it to the corresponding uniode char

import re

string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President<U+2014>Elect"

repbool = re.search('[<][U][+]\d{4}[>]', string)

while repbool:
  rep = re.search('[<][U][+]\d{4}[>]', string).group()
  
  string=string.replace(rep, chr(int(rep[1:-1][2:], 16)))
 
  repbool = re.search('[<][U][+]\d{4}[>]', string)
  

print(string)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	CrYbAbY
Solution 2
Solution 3
Solution 4

'Replacing Unicode Characters with actual symbols

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]