'What can NOT be described by a PCRE regex?

I am using a lot of regular expressions and stumbled over the question what actually can not be described by a regex.

First example that came to my mind was matching a string like XOOXXXOOOOXXXXX.... This would be a string consisting of an alternating sequence of X's and O's where each subpart consisting only of the character X or O is longer than the previsous sequence of the other character.

Can anybody explain what is the formal limit of a regex? I know this might be a rather academic question but I'm a curious person ;-)

Edit As I am a php guy I am especially interested in regex described by PCRE standard as described here: http://php.net/manual/en/reference.pcre.pattern.syntax.php I know that PCRE allows a lot of things that are not part of the original regular expressions like back references.

Mathcing of balanced parentheses seems to be one example that can not be matched by regular expressions in general but it can be matched using PCRE (see http://sandbox.onlinephpfunctions.com/code/fd12b580bb9ad7a19e226219d5146322a41c6e47 for live example):

$data = array('()', '(())', ')(', '(((()', '(((((((((())))))))))', '()()');    
$regex = '/^((?:[^()]|\((?1)\))*+)$/';

foreach($data as $d) {
  echo "$d matched by regex: " . (preg_match($regex, $d) ? 'yes' : 'no') . "\n";
}


Solution 1:[1]

First example that came to my mind was matching a string like XOOXXXOOOOXXXXX.... This would be a string consisting of an alternating sequence of X's and O's where each subpart consisting only of the character X or O is longer than the previsous sequence of the other character.

Yes, that can be done.


  1. To match a non-empty sequence of x's, followed by a greater number of o's, we can use an approach similar to that of the balanced-parentheses regex:

    (x(?1)?o)o+
    
  2. To match a string of x's and o's such that any sequence of x's is followed by a longer sequence of o's (except optionally at the very end), we can build on pattern #1:

    ^o*(?:(x(?1)?o)o+)*x*$
    
  3. And of course, we'll also need a variant of pattern #2 with x's and o's flipped:

    ^x*(?:(o(?1)?x)x+)*o*$
    
  4. To match a string of x's and o's that meet both of the above criteria, we can convert pattern #2 to a positive lookahead assertion, and renumber the capture-group in pattern #3:

    ^(?=o*(?:(x(?1)?o)o+)*x*$)x*(?:(o(?2)?x)x+)*o*$
    

As for the main question . . . I'm confident that a PCRE can match any context-free language, since the support for (?n) outside of the nth capture-group means that you can basically create a subroutine for each of your non-terminals. For example, this context-free grammar:

  • S ? aTb
  • S ? ?
  • T ? cSd
  • T ? eTf

can be written as:

  • capture-group #1 (S) ? (a(?2)b|)
  • capture-group #2 (T) ? (c(?1)d|e(?2)f)

To assemble that into a single regex, we can just concatenate them all, but appending {0} after all but the start non-terminal, and then add ^ and $:

^(a(?2)b|)(c(?1)d|e(?2)f){0}$

But as you saw from your first example, PCREs can match some non-context-free languages as well. (Another example is anbncn, which is a classic example of a non-context-free language. You can match it with PCRE by combining a PCRE for anbncm with a PCRE for ambncn using a forward lookahead assertion. Although the intersection of two regular languages is necessarily regular, the intersection of two context-free languages is not necessarily context-free; but the intersection of the languages defined by two PCREs can be defined by a PCRE.)

Solution 2:[2]

The set of all languages that can be recognized by a regular expression is called, not surprisingly, "regular languages".

The next most complicated languages are the context-free languages. They cannot be parsed by any regular expression. The standard example is "all balanced parentheses" -- so "()()" and "(())" but not "(()".

Another good example of a context-free language is HTML.

Solution 3:[3]

I don't have definitive evidence that any of these are impossible with things like recursion, balancing groups, self-referencing groups, and appending text to the string being tested. I would be glad to be proven wrong on any or all of these, as I would learn something!

  1. It's pretty bad at math.

For example, I do not believe it is possible using PCRE, to detect a sequence of numbers that is ascending: that is, to match "1 2 7 97 315 316..." but not "127 97 315 316..."

  1. I'm not sure it's possible to even match a sequence contiguously increasing from 1, like "1 2 3...", without exhaustively listing all possibilities like /1( 2( 3(...)?)?)?/ up to the max length you wish to check.

Thee are hacks to make it work by adding known text to the string under test (eg http://www.rexegg.com/regex-trick-line-numbers.html works by adding a series of numbers to the end of the file). But as raw regex, simple math is only possible by brute-forcing.

  1. Another example which I believe it will fail at is "match any sequence which sums to N".

So for N=4, it should match 4, 3 1, 1 3, 2 2, 1 1 1 1, 2 1 1, 1 2 1, 1 1 2, 1 1 1 1, which looks like you could brute-force it, until you realize it also has to match 4 -12 11 0 1.

  1. In the same manner, I don't think you could have it analyze an equation using SI units, and verify whether the units balanced on both sides of the equation. For example, "10N=2kg*5ms^-1". Never mind checking the values, just checking the units are correct.

  2. Then there're all the classes of problems that no computer program can currently accomplish, such as "check if a string has been correctly title cased in English" which would require a context-sensitive natural language parser to correctly detect the different senses of "like" in "Time Flies like an Arrow But Fruit Flies Like a Banana".

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3