'Extract text up to the n-th character in a string, but return the whole string if the character isn't present

I was looking at this question and the accepted answer gives this as a solution for the case when there are fewer than n characters in the string:

^(([^>]*>){4}|.*)

However, I have done a fiddle here, and it shows that this regex will just simply return the entire string all of the time.

This code:

SELECT
  SUBSTRING(a FROM '^(([^>]*>){4}|.*)'),
  a,
  LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')),
  LENGTH(a),

  LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')) = LENGTH(a)
  
FROM s
WHERE   LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')) = LENGTH(a) IS false;

after several runs returns no records - meaning that the regex is doing nothing.

Question:

I would like a regex which returns up to the fourth > character (not including it) OR the entire string if the string only contains 3 or fewer > characters. RTRIM() can always be used to trim the final > if not including it is too tricky - having an answer which gives both possibilities would help me to deepen my understanding of regexes!

This is not a duplicate - it's certainly related, but I'd like to correct the error in the original answer - and provide a correct answer of my own.



Solution 1:[1]

You can repeat matching 0-3 times including the > using

^(?:[^>]*>){0,3}[^>]*
  • ^ Start of string
  • (?:[^>]*>){0,3} Repeat 0 - 3 times matching any character except > and then match >
  • [^>]* Optionally match any char except >

See a regex demo.

If there should be at least a single > then the quantifier can be {1,3}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1