'Python3 DEFAULT (implicit) unicode UTF-8 normalization: how to?
Yepp, this is way not-the-first question about unicode normalization in Python.
As many folks know, there are "same" unicode letters which are "not the same" (and even string lengthes differ!):
In [1]: s='å'
In [2]: import unicodedata
In [3]: q=unicodedata.normalize('NFD', s)
In [4]: q
Out[4]: 'å'
In [5]: s
Out[5]: 'å'
In [6]: s == q
Out[6]: False
In [7]: len(s), len(q)
Out[7]: (1, 2)
So, the question is: Is there a way to set DEFAULT normalization (say, "NFC") for all and every .decode("utf-8") calls?
I mean, can I add a hook (or whatever) to really normalize any input?
I've faced the problem handling input from different browsers in a backend API… Sometimes it arrive in NFD (for unknown reason) and make searches to fail.
PS. I do not want to "fix" all and every input routines to re-normalize what they get (already done and dislike it).
PPS. I'd like to have something like
class mystr(str):
'''
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str
'''
def __init__(self, *av, **kw):
self._default_norm = kw.pop('utf_normalize', 'NFC')
super().__init__(*av, **kw)
if kw.get('encoding', sys.getdefaultencoding()).lower() in ('utf-8', 'utf8'):
self = unicodedata.normalize(self._default_norm, self)
...
_builtin['str'] = mystr # "is this a real life, or just a fantasy"©…
Solution 1:[1]
No, there is no built-in explicit functionality to do this for you. However, Python lets you easily replace or wrap built-in classes, so it's not hard to build your own string type with the desired behavior.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | tripleee |
