'When do I need u-modifier in PHP regex?

I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u modifier for handling input and regex as UTF-8.

But, do I really need this always? My tests show, that this flag makes no difference, when I don't use escape sequences or dot or something like this.

For example

preg_match('/^[\da-f]{40}$/', $string); to check if string has format of a SHA1 hash

preg_replace('/[^a-zA-Z0-9]/', $spacer, $string); to replace every char that is non-ASCII letter or number

preg_replace('/^\+\((.*)\)$/', '\1', $string); for getting inner content of +(XYZ)

These regex contain only single byte ASCII symbols, so it should work on every input, regardless of encoding, shouldn't it? Note that third regex uses dot operator, but as I cut off some ASCII chars at beginning and end of string, this should work on UTF-8 also, correct?

Cannot anyone tell me, if I'm overlooking something?



Solution 1:[1]

Unicode modifier u allows proper detection of accented characters, which are always multibyte.

preg_match('/([\w ]{2,})/', 'baz báz báž', $match); 
// $match[0] = "baz b" ... wrong, accented/multibyte chars silently ignored

preg_match('/([\w ]{2,})/u', 'baz báz báž', $match); 
// $match[0] = "baz báz báž" ... correct

Use it also for safe detection of whitespaces:

preg_replace(''/\s+/u', ' ', $txt); // works reliably e.g. with EOLs (line endings)

Solution 2:[2]

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

You will need this when you have to compare Unicode characters, such as Korean or Japanese.

In other words, unless you are not comparing strings that is not Unicode (such as English), You don't need to use this flag.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 YJM