'An example of how to use RGI_Emoji in a Regex

I'm currently looking at regexs and emojis, and I'd like to use unicode property escapes to simplify the task

In https://unicode.org/reports/tr18/#Full_Properties, it lists a number of emoji properties such as Emoji and Emoji_Presentation etc.

Creating a regex using these patterns works:

const re = /\p{Emoji}/gu

The same page also lists RGI_Emoji, which is

The set of all emoji (characters and sequences) covered by ED-20, ED-21, ED-22, ED-23, ED-24, and ED-25.

or basic emojis, modifiers, etc, which seems to cover all use cases that I'm looking at.

However, creating a regex using this:

const re = /\p{RGI_Emoji}/gu

Gives a SyntaxError:

Uncaught SyntaxError: invalid property name in regular expression

The unicode page does mention that

Properties marked with * are properties of strings, not just single code points.

which RGI_Emoji is marked as. My knowledge of unicode isn't amazing, so I'm not sure of the proper way to use this.

Is it possible to use RGI_Emoji in a regex, and if so, what's the correct way to use it?



Solution 1:[1]

The emoji properties were only added to UTS #18 relatively recently (mid 2020), and this involved a significant change in Unicode's properties model in that it involved formally defining for the first time properties of strings. RGI_Emoji is a binary-valued property of strings of characters. A potential issue for use of string properties in regex is that the set corresponding to a string property is potentially a vast number of strings. To avoid potential problems in existing implementations, UTS #18 allows for use of the syntax \m{Property_Name} for string properties. See https://www.unicode.org/reports/tr18/#Resolving_Character_Ranges_with_Strings for more information.

It's possible that the implementation you're using has not been fully updated for Rev. 21 of UTS #18, with support for all new properties, or that it requires you to use the \m syntax for string properties.

The online Unicode UnicodeSet utility does support enumerating string results of a regex using the RGI_Emoji property:

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BRGI_Emoji%7D&g=&i=

Solution 2:[2]

RGI_Emoji is not available in JavaScript yet.

It is mentioned on top of the Full Properties table that,

Properties marked with * are properties of strings, not just single code points.

Support for following sequence properties is being proposed in proposal-regexp-unicode-sequence-properties. The proposal is at stage 2 i.e. not part of the ECMAScript specification yet and hence not available.

RGI_Emoji
Basic_Emoji
Emoji_Keycap_Sequence
RGI_Emoji_Modifier_Sequence
RGI_Emoji_Flag_Sequence
RGI_Emoji_Tag_Sequence
RGI_Emoji_ZWJ_Sequence

To further confirm, check available \p{UnicodeBinaryPropertyName}'s in the latest ECMAScript specification. Only following properties of characters related to emoji's are available:

Emoji
Emoji_Component
EComp
Emoji_Modifier
EMod
Emoji_Modifier_Base
EBase
Emoji_Presentation

You'll have to form a regular expression with unicode ranges covering ED-20, ED-21, ED-22, ED-23, ED-24, and ED-25 unicode sets. Like suggested by @JosefZ in a comment.
This discussion may help JavaScript regular expression for Unicode emoji

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Peter Constable
Solution 2