'What's a good regex to include accented characters in a simple way?
Right now my regex is something like this:
[a-zA-Z0-9] but it does not include accented characters like I would want to. I would also like - ' , to be included.
Solution 1:[1]
Accented Characters: DIY Character Range Subtraction
If your regex engine allows it (and many will), this will work:
(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])+$
Please see the demo (you can add characters to test).
Explanation
(?i)sets case-insensitive mode- The
^anchor asserts that we are at the beginning of the string (?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])matches one character...- The lookahead
(?![×Þß÷þø])asserts that the char is not one of those in the brackets [-'0-9a-zÀ-ÿ]allows dash, apostrophe, digits, letters, and chars in a wide accented range, from which we need to subtract- The
+matches that one or more times - The
$anchor asserts that we are at the end of the string
Reference
Solution 2:[2]
You put in your expression:
\p{L}\p{M}
This in Unicode will match:
- any letter character (L) from any language
- and marks (M)(i.e, a character that is to be combined with another: accent, etc.)
Solution 3:[3]
A version without the exclusion rules:
^[-'a-zA-ZÀ-ÖØ-öø-ÿ]+$
Explanation
- The
^anchor asserts that we are at the beginning of the string [...]allows dash, apostrophe, digits, letters, and chars in a wide accented range,- The
+matches that one or more times - The
$anchor asserts that we are at the end of the string
Reference
Solution 4:[4]
Use a POSIX character class (http://www.regular-expressions.info/posixbrackets.html):
[-'[:alpha:]0-9] or [-'[:alnum:]]
The [:alpha:] character class matches whatever is considered "alphabetic characters" in your locale.
Solution 5:[5]
@NightCoder's answer works perfectly:
\p{L}\p{M}
and with no brittle whitelists. Note that to get it working in javascript you need to add the unicode u flag. Useful to have a working example in javascript...
[..."Crêpes are øh-so déclassée".matchAll( /[-'’\p{L}\p{M}\p{N}]+/giu )]
will return something like...
[
{
"0": "Crêpes",
"index": 0
},
{
"0": "are",
"index": 7
},
{
"0": "øh-so",
"index": 11
},
{
"0": "déclassée",
"index": 17
}
]
Here it is in a playground... https://regex101.com/r/ifgH4H/1/
And also some detail on those regex unicode categories... https://javascript.info/regexp-unicode
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | just.jules |
| Solution 4 | Brian Stephens |
| Solution 5 |
