'Regex to find repeated sentences from more to less
i have string like
$string = "hello this is a string and hello but this is a string hello but this is a string and ";
in it there is repeated words and repeated sentences but i only want the sentences so i expect
hello but this is a string to be captured
i tried using this regex (.{10,}).*?\1 but it got me this is a string and
but i want to get hello but this is a string because it is the most letters from 10+ without making it {25,} to match more only
but it is also very very slow
- Cary Swoveland: my plan is to capture longest string repeated and remove it from the string and leaving only one so in my example it would be
hello this is a string and hello but this is a string and
Solution 1:[1]
I wish to suggest an algorithm to compute longest repeated substring of a given string that is comprised of words and is not preceded nor followed by a word character.
The approach is straightforward: condition on the word in the string that is the first word of the substring. I initially remove non-word characters at the beginning of the string, then find the longest repeating string that begins at the start of that string.
Next, regardless of whether a repeating string was found, a new string is formed by removing the first word of the modified string as well as following non-word characters. The process is repeated, at each step where a repeating string is found the length of that string being compared to the length of the previously longest-known repeating string.
This algorithm can be implemented in Ruby as follows. I realize that a PHP solution is desired, of course, but I do not know PHP. However, the Ruby code reads much like pseudo-code. Perhaps an inspired reader would be interested in converting it to PHP.
RGX = /\A(.*\w)\b(?=.*\b\1\b)/
def longest_repeating(str)
longest = { str: '', len: 0 } # best solution known so far
loop do # loop until breaking out by returning
i = (str =~ /\w/) # index of first word char if present
return longest if i.nil? # no more words to examine
str = str[i..-1] # remove first i characters
s = str[RGX] # obtain string matched by RGX
if s # match found
n = s.length # update longest if new longest found
longest = { str: s, len: n } if n > longest[:len]
end
str = str[/ .*/] # remove leading spaces from str
end
end
longest_repeating "hello this is a string and hello but this is a string hello but this is a string and "
#=> {:str=>"hello but this is a string", :len=>26}
longest_repeating "aaa bbb ccc ddd bbb ccc ddd eee fff fff ggg hhh iii jjj kkk lll eee fff fff ggg hhh iii jjj"
#=> {:str=>"eee fff fff ggg hhh iii jjj", :len=>27}
The regular expression held by RGX can be broken down as follows.
\A # match beginning of string
(.*\w) # match zero or more characters followed by a word
# characters, save to capture group 1
\b # match a word boundary
(?= # begin a positive lookahead
.* # match zero or more characters
\b\1\b # match the content of capture group 1 with word boundaries
)
Note that the code ensures that the first character matched by (.*\w) is a word character.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
