'Regex to find repeated sentences from more to less

i have string like

$string = "hello this is a string and hello but this is a string hello but this is a string and ";

in it there is repeated words and repeated sentences but i only want the sentences so i expect

hello but this is a string to be captured

i tried using this regex (.{10,}).*?\1 but it got me this is a string and

but i want to get hello but this is a string because it is the most letters from 10+ without making it {25,} to match more only

but it is also very very slow


  • Cary Swoveland: my plan is to capture longest string repeated and remove it from the string and leaving only one so in my example it would be

hello this is a string and hello but this is a string and



Solution 1:[1]

I wish to suggest an algorithm to compute longest repeated substring of a given string that is comprised of words and is not preceded nor followed by a word character.

The approach is straightforward: condition on the word in the string that is the first word of the substring. I initially remove non-word characters at the beginning of the string, then find the longest repeating string that begins at the start of that string.

Next, regardless of whether a repeating string was found, a new string is formed by removing the first word of the modified string as well as following non-word characters. The process is repeated, at each step where a repeating string is found the length of that string being compared to the length of the previously longest-known repeating string.

This algorithm can be implemented in Ruby as follows. I realize that a PHP solution is desired, of course, but I do not know PHP. However, the Ruby code reads much like pseudo-code. Perhaps an inspired reader would be interested in converting it to PHP.

RGX = /\A(.*\w)\b(?=.*\b\1\b)/
def longest_repeating(str)
  longest = { str: '', len: 0 } # best solution known so far
  loop do                       # loop until breaking out by returning
    i = (str =~ /\w/)           # index of first word char if present
    return longest if i.nil?    # no more words to examine
    str = str[i..-1]            # remove first i characters  
    s = str[RGX]                # obtain string matched by RGX
    if s                        # match found
      n = s.length              # update longest if new longest found
      longest = { str: s, len: n } if n > longest[:len]
    end
    str = str[/ .*/]            # remove leading spaces from str
  end
end
longest_repeating "hello this is a string and hello but this is a string hello but this is a string and "
  #=> {:str=>"hello but this is a string", :len=>26}
longest_repeating "aaa bbb ccc ddd bbb ccc ddd eee fff fff ggg hhh iii jjj kkk lll eee fff fff ggg hhh iii jjj"           
  #=> {:str=>"eee fff fff ggg hhh iii jjj", :len=>27}

The regular expression held by RGX can be broken down as follows.

\A        # match beginning of string
(.*\w)    # match zero or more characters followed by a word
          # characters, save to capture group 1
\b        # match a word boundary
(?=       # begin a positive lookahead
  .*      # match zero or more characters
  \b\1\b  # match the content of capture group 1 with word boundaries
)

Note that the code ensures that the first character matched by (.*\w) is a word character.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1