'Django "surrogates not allowed" error on model.save() call when text includes emoji character
We are currently in the process of building a system that stores text in a PostgreSQL DB via Django. The data gets then extracted via PGSync to ElasticSearch.
At the moment we have encountered the following issue in a testcase
Error Message:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 159-160: surrogates not allowed
We identified the character that causes that issue. It is an emoji.
The text itself is a mixture of Greek Characters, "English Characters" and as it seems emojis. The greek is not shown as greek, but instead in the \u form.
Relevant Text that causes the issue:
\u03bc\u03b5 Some English Text \ud83d\ude9b\n#SomeHashTag
\ud83d\ude9b\ translates to this emoji:🚛
As it says here: https://python-list.python.narkive.com/aKjK4Jje/encoding-of-surrogate-code-points-to-utf-8
The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters.
PostgreSQL has the following encodings:
- Default:UTF8
- Collate:en_US.utf8
- Ctype:en_US.utf8
Is this an utf8 issue? or specific to emoji? Is this a django or postgresql issue?
Solution 1:[1]
Reproduce the issue:
x='\u03bc\u03b5 Some English Text \ud83d\ude9b\n#SomeHashTag'
print(x)
Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode characters in position 21-22: surrogates not allowed
Solution: apply raw_unicode_escape and unicode_escape codecs (see Python Specific Encodings) as follows:
y = x.encode('raw_unicode_escape').decode('unicode_escape').encode('utf-16_BE','surrogatepass').decode('utf-16_BE')
print(y)
?? Some English Text ? #SomeHashTag
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | JosefZ |
