'how do I apply normalize function to pandas string series?

I would like to apply the following function to a dataframe series:

unicodedata.normalize('NFKD', c.lower().decode('utf-8')).encode('ascii','ignore')

I (sort of) understand how I can do stuff like db.cname.str.lower(), but I'm not able to generalize to any other function after the string accessor.

How do I apply the normalize function to all members of the series?



Solution 1:[1]

If c is your string column. map is used to apply a function elementwise (and of course you wouldn't have to chain it all together like this)

df[c] = (df[c].str.lower()
              .str.decode('utf-8')
              .map(lambda x: unicodedata.normalize('NFKD', x))
              .str.encode('ascii', 'ignore'))

Solution 2:[2]

I believe since Pandas version 1.0, one can use the built-in method for the normalization, so this part can be carried out with:

df[a].str.normalize('NFKD')

If I understood correctly your function, the order of lower() and decode() are switched because you cannot put bytes to lower case, only characters, but bytes are the input of decode(). So assuming c are bytes and d[a] is a Series of bytes, it could be done with:

df[a].str.decode('utf-8').str.lower().str.normalize('NFKD').str.encode('ascii', 'ignore')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 chrisb
Solution 2