'An example of how to use RGI_Emoji in a Regex
I'm currently looking at regexs and emojis, and I'd like to use unicode property escapes to simplify the task
In https://unicode.org/reports/tr18/#Full_Properties, it lists a number of emoji properties such as Emoji and Emoji_Presentation etc.
Creating a regex using these patterns works:
const re = /\p{Emoji}/gu
The same page also lists RGI_Emoji, which is
The set of all emoji (characters and sequences) covered by ED-20, ED-21, ED-22, ED-23, ED-24, and ED-25.
or basic emojis, modifiers, etc, which seems to cover all use cases that I'm looking at.
However, creating a regex using this:
const re = /\p{RGI_Emoji}/gu
Gives a SyntaxError:
Uncaught SyntaxError: invalid property name in regular expression
The unicode page does mention that
Properties marked with * are properties of strings, not just single code points.
which RGI_Emoji is marked as. My knowledge of unicode isn't amazing, so I'm not sure of the proper way to use this.
Is it possible to use RGI_Emoji in a regex, and if so, what's the correct way to use it?
Solution 1:[1]
The emoji properties were only added to UTS #18 relatively recently (mid 2020), and this involved a significant change in Unicode's properties model in that it involved formally defining for the first time properties of strings. RGI_Emoji is a binary-valued property of strings of characters. A potential issue for use of string properties in regex is that the set corresponding to a string property is potentially a vast number of strings. To avoid potential problems in existing implementations, UTS #18 allows for use of the syntax \m{Property_Name} for string properties. See https://www.unicode.org/reports/tr18/#Resolving_Character_Ranges_with_Strings for more information.
It's possible that the implementation you're using has not been fully updated for Rev. 21 of UTS #18, with support for all new properties, or that it requires you to use the \m syntax for string properties.
The online Unicode UnicodeSet utility does support enumerating string results of a regex using the RGI_Emoji property:
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BRGI_Emoji%7D&g=&i=
Solution 2:[2]
RGI_Emoji is not available in JavaScript yet.
It is mentioned on top of the Full Properties table that,
Properties marked with * are properties of strings, not just single code points.
Support for following sequence properties is being proposed in proposal-regexp-unicode-sequence-properties. The proposal is at stage 2 i.e. not part of the ECMAScript specification yet and hence not available.
RGI_Emoji
Basic_Emoji
Emoji_Keycap_Sequence
RGI_Emoji_Modifier_Sequence
RGI_Emoji_Flag_Sequence
RGI_Emoji_Tag_Sequence
RGI_Emoji_ZWJ_Sequence
To further confirm, check available \p{UnicodeBinaryPropertyName}'s in the latest ECMAScript specification. Only following properties of characters related to emoji's are available:
Emoji
Emoji_Component
EComp
Emoji_Modifier
EMod
Emoji_Modifier_Base
EBase
Emoji_Presentation
You'll have to form a regular expression with unicode ranges covering ED-20, ED-21, ED-22, ED-23, ED-24, and ED-25 unicode sets. Like suggested by @JosefZ in a comment.
This discussion may help JavaScript regular expression for Unicode emoji
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Peter Constable |
| Solution 2 |
