'Regex tweak to remove duplicate text
The regex (?s)(.{10,})(?=\1) is used to remove duplicates of text portions longer than 10 chars. It generally works well, but in the snippet linked to below it misses the duplication of the phrase beginning with the words "Assisted in documenting application".
Any idea how to improve the regex so it will catch that duplication?
Here's the snippet: https://regex101.com/r/sjACIb/1
Solution 1:[1]
The phrase isn't immediately repeating, Another sentence separates it from its recurrence: -Responsible for researching and writing new content for Nexus' website.
You can add a non-capturing group to handle possible characters between the two occurences:
(.{10,})(?:.*)(?=\1)
Note that this will also match certification in ...LEED-CI certification (anticipating Gold Level certification).
Edit: if you want to stick to the single line modifier, you'll have to specify that you don't want to match new lines in the repeating phrase in order to avoid catastrophic backtracking (new lines are then still allowed between a phrase and its reocurrence):
(?s)([^\n]{10,})(?:.*)(?=\1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
