'Why non-greedy mod is ignored in my regex?

I am having problems with a non-greedy regex

I need capture a number (in a specific format) that follow a specific word (CPF), and a couple of words (a name, always in uppercase) preceding it.

This is a example of my data:

NOT THIS NAME, blah blah, ETC ETC ETC, XXXX 00.000.000/0001-00, blah blah, 123, ZIP 12345-123, blah blah, blah TEXT TEXT TEXT TEXT, blah blah THIS NAME, blah blah, blah blah CPF 999.999.999-99 blah blah

My regex is (\b[A-Z\sÀÈÌÒÙÁÉÍÓÚÝÂÊÎÔÛÃÑÕÄËÏÖÜŸ]{4,}\b).*?CPF.*?(\d{3}.\d{3}.\d{3}.\d{2}), and while it capture the number after CPF without problems, it capture "NOT THIS NAME" instead of "THIS NAME".

I tried everything i knew (not much, i admit) at https://regex101.com/ without success...

The .*? before CPF shouldn't match the minimum amount, thus capturing the uppercase word nearest CPF?

Or the regex start at beginning of line, matching the first uppercase words it found, and then the non-greedy .*? capture everything till (the first) CPF it found? If so, there's a way to set CPF as the "point of start"?

Thanks in advance



Solution 1:[1]

You can use

\b(?!\s)([A-Z\sÀÈÌÒÙÁÉÍÓÚÝÂÊÎÔÛÃÑÕÄËÏÖÜŸ]{4,})\b[^A-ZÀÈÌÒÙÁÉÍÓÚÝÂÊÎÔÛÃÑÕÄËÏÖÜŸ]*CPF.*?(\d{3}.\d{3}.\d{3}.\d{2})

See the regex demo. Details:

  • \b - word boundary
  • (?!\s) - no whitespace allowed right after
  • ([A-Z\sÀÈÌÒÙÁÉÍÓÚÝÂÊÎÔÛÃÑÕÄËÏÖÜŸ]{4,}) - Group 1: four or more uppercase letters from the sepcified set/range
  • \b - word boundary
  • [^A-ZÀÈÌÒÙÁÉÍÓÚÝÂÊÎÔÛÃÑÕÄËÏÖÜŸ]* - zero or more chars other than uppercase letters from the sepcified set/range
  • CPF - a fixed substring
  • .*? - any zero or more chars other than line break chars as few as possible
  • (\d{3}.\d{3}.\d{3}.\d{2}) - Group 2: three digits, any char, three digits, any char, three digits, any char, two digits.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wiktor Stribiżew