'Can I represent arbitrary binary data in WTF-8 (or any extension of UTF-8)?

A long time ago, there was a two-byte Unicode encoding UCS-2, but then it was determined that two bytes are sometimes not enough. In order to cram more codepoints into 16 bit, surrogate pairs were introduced in UTF-16. Since Windows started out with UCS-2, it doesn't enforce rules around surrogate pairs in some places, most notably file systems.

Programs that want to use UTF-8 internally have a problem now dealing with these invalid UTF-16 sequences. For this, WTF-8 was developed. It is mostly relaxed UTF-8, but it is able to round-trip invalid surrogate pairs.

Now it seems like it should be possible to relax UTF-8 a bit further, and allow it to represent arbitrary binary data, round-tripping safe. The strings I am thinking about are originally 99.9% either valid UTF-8, or almost valid UTF-16 of the kind WTF-8 can stomach. But occasionally there will be invalid byte sequences thrown in.

WTF-8 defines generalized UTF-8 as:

an encoding of sequences of code points (not restricted to Unicode scalar values) using 8-bit bytes, based on the same underlying algorithm as UTF-8. It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).

Would generalized UTF-8 allow me to store arbitrary 32 bit sequences, and thus arbitrary data? Or is there another way, such as a unicode escape character? Things I don't want to do are base64 encoding or percent-encoding, since I want to leave valid unicode strings unchanged.

Standard disclaimer: I encountered this problem a couple times before, but right now it is an academic question, and I'm just interested in a straight answer how to do this. There is no XY problem :-)



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source