'Why Unicode u+202e and u+202c cause output text to have a different result
In Java:
If I print
"123\u202e987\u202c456abc"then the result is 123987456abcIf I print
"123\u202e987\u202cxyzabc"then the result is 123987xyzabc
You see that when "456" is changed to "xyz" inthe string to be printed the output sequences are different.
How does this work?
Solution 1:[1]
TLDR: The effect you are seeing arises because digits and alphabetic characters are treated differently by the Unicode algorithm that determines the rendering of text containing format control characters.
For the texts you are displaying:
- \u202e is the RIGHT-TO-LEFT OVERRIDE (RLO) character.
- \u202c is the POP DIRECTIONAL FORMATTING (PDF) character.
- Both are formatting control characters in Unicode, and their sole effect is to impact the appearance of output text.
- In your examples the RLO character specifies that the text which follows is to be displayed from right to left (RLO), and PDF character cancels ("pops") the effect of the RLO.
That explains why the text 123\u202e987\u202cxyzabc in your example is rendered as 123?987?xyzabc. The RLO (\u202e) causes the text that follows to be rendered in right to left order (so 987 is displayed as 789), and the PDF (\u202c) terminates reversal for the subsequent text.
But it does not explain why 123\u202e987\u202c456abc is rendered as 123456789abc. By that argument, the expected output should be 123789456abc instead.
The algorithm used to determine the output in scenarios like this is very complex, but one factor is the directionality of the characters being rendered. Alphabetic characters have strong directionality, but numbers (i.e. digit characters) have weak directionality. For full details see the Unicode document UnicodeĀ® Standard Annex #9 UNICODE BIDIRECTIONAL ALGORITHM, and especially section 3.3.4 Resolving Weak Types
That document provides an example similar to yours, with text containing a RIGHT-TO-LEFT EMBEDDING (RLE) character (rather than an RLO), later followed by a PDF and some trailing text containing digits:
Memory: it is called "[RLE]AN INTRODUCTION TO java[PDF]" - $19.95 in hardcover.
Display: it is called "$19.95 - "java OT NOITCUDORTNI NA in hardcover.
Note that in their example it wasn't just the digits that were moved. The dollar sign and the period were as well, because all six of the characters in the text $19.95 have weak directionality.
Notes:
- You can get the directionality category of any Unicode character in Java using Character.getDirectionality(int codePoint)
- The Unicode document linked above is heavy reading. Basic introductions to bidirectional text include W3C's Unicode Bidirectional Algorithm basics and Unicode's Writing Direction and Bidirectional Text FAQ.
Solution 2:[2]
The Unicode is doing that. Because both depend on the text after them and edits them in a way.
- \u202e reverses text (RIGHT-TO-LEFT override)
- \u202c: POP DIRECTIONAL FORMATTING
In your question, 123\u202e987\u202cxyzabc outputs 123?987?xyzabc. \u202e causes the 987 to be outputted (reversed) as 789. And \u202c stops the RIGHT-TO-LEFT override.
In the second case, after the \u202c are some digits, which have weak directionality. So, the unicode causes only the digits to be directed to before the \u202e.
EDIT: @skomisa's answer is better.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | skomisa |
| Solution 2 |
