'Django "surrogates not allowed" error on model.save() call when text includes emoji character

We are currently in the process of building a system that stores text in a PostgreSQL DB via Django. The data gets then extracted via PGSync to ElasticSearch.

At the moment we have encountered the following issue in a testcase

Error Message:

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 159-160: surrogates not allowed

We identified the character that causes that issue. It is an emoji.

The text itself is a mixture of Greek Characters, "English Characters" and as it seems emojis. The greek is not shown as greek, but instead in the \u form.

Relevant Text that causes the issue:

\u03bc\u03b5 Some English Text \ud83d\ude9b\n#SomeHashTag

\ud83d\ude9b\ translates to this emoji:🚛

As it says here: https://python-list.python.narkive.com/aKjK4Jje/encoding-of-surrogate-code-points-to-utf-8

The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters.

PostgreSQL has the following encodings:

Default:UTF8
Collate:en_US.utf8
Ctype:en_US.utf8

Is this an utf8 issue? or specific to emoji? Is this a django or postgresql issue?

Solution 1:^[1]

Reproduce the issue:

x='\u03bc\u03b5 Some English Text \ud83d\ude9b\n#SomeHashTag'
print(x)

Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode characters in position 21-22: surrogates not allowed

Solution: apply raw_unicode_escape and unicode_escape codecs (see Python Specific Encodings) as follows:

y = x.encode('raw_unicode_escape').decode('unicode_escape').encode('utf-16_BE','surrogatepass').decode('utf-16_BE')
print(y)

?? Some English Text ?
#SomeHashTag

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	JosefZ

'Django "surrogates not allowed" error on model.save() call when text includes emoji character

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]