'Remove repeating characters from sentence but retain the words meaning

I want to remove repeating characters from a sentence but make it so that the words still retain its meaning (if it has any). For example : I'm so haaappppyyyy about offline school to I'm so happy about offline school. See, haaappppyyyy became happy and offline & school stay the same instead becoming ofline & schol

I've tried two solutions, using RE and itertools, but none really fits for what I'm searching for

Using Regex :

tweet = 'I'm so haaappppyyyy about offline school'
repeat_char = re.compile(r"(.)\1{1,}", re.IGNORECASE)
tweet = repeat_char.sub(r"\1\1", tweet)
tweet = re.sub("(.)\\1{2,}", "\\1", tweet)

output :

I'm so haappyy about offline school #it makes 2 chars for every repating chars

using itertools :

tweet = 'I'm so happy about offline school'
tweet = ''.join(ch for ch, _ in itertools.groupby(tweet))

output :

I'm so hapy about ofline schol

How can I fix this? should I make a lists of words I want to exclude?

In addition, I want it to also be able to reduce some words that's in a pattern to it's base form. For example :

wkwk (base form)
wkwkwkwk
wkwkwkwkwkwkwk

I want to make the second and the third word into the first word, the base form



Solution 1:[1]

This answer was originally written for Regex to reduce repeated chars in a string which was closed as duplicate before I could submit my post. So I "recycled" it here.


Regex is not always the best solution

Regex for validation of formats or input

A regex is often used for low-level pattern recognition and substitution. It may be useful for validation of formats. You can see it as "dump" automation.

Linguistics (NLP)

When it comes to natural language (NLP), or here spelling (dictionary) the semantics may play a role. Depending on the context "ass" and "as" may both be correctly spelled, although the semantics are very different. (I apologize for the rude examples, but I am not a native-speaker and those two had the most distinct meaning depending on re-duplication).

For those cases a regex or simple pattern-recognition may be not sufficient. It can cause more effort to apply it correctly than the research for a language-specific library or solution (including a basic application).

Examples for spelling that a regex may struggle with

Like the difference between "haappy" (orthographically invalid, but only the duplicated vowels "aa", not the consonants "pp") and "yeees" (contains no duplicates in correct spelling) or "kiss" (is correctly spelled with duplicate consonants)

Spelling correction requires more

For example a dictionary to lookup if duplicate characters (vowels or consonants) are valid for correct spelling of the word in its form.

Consider a spelling-correction module

You could use textblob module for spelling correction:

To install: pip install textblob

Example for some test-cases (independent words):

from textblob import TextBlob
 
incorrect_words = ["cmputr", "yeees", "haappy"]  # incorrect spelling

text = ",".join(incorrect_words)  # join them as comma separated list
print(f"original words: {text}")
 
b = TextBlob(text)

# prints the corrected spelling
print(f"corrected words: {b.correct()}")

Prints:

original words: cmputr,yeees,haappy
corrected words: computer,eyes,happy

Surprise: You might have expected "yes" (so did I). But the correction results not in removal of 2 duplicated vowels "ee", but rearrangement to keep almost all letters (5 of 6, only removed one "e").

Example for the given sentence:

from textblob import TextBlob

tweet = "I'm so haaappppyyyy about offline school"  # either escape or use different quotes when a single-quote (') is enclosed
print(TextBlob(tweet).correct())

Prints:

I'm so haaappppyyyy about office school

Unfortunately quite worse:

  • not "happy"
  • semantically out-of-scope with "office" instead "offline"

Apparently a preceeding cleaning step using regex, like Wiktor suggests, may ameliorate the result.

See also:

Solution 2:[2]

Well, first of all you need a list (or set) of all allowed words, to compare with.

I'd approach it with the assumption (which might be wrong) that no words contain sequences of more than two repeating characters. So for each word generate a list of all potential candidates, for example "haaappppppyyyy" would yield you ["haappyy", "happyy", "happy", etc]. then it's just a matter of checking which one of those words actually exists by comparing to the allowed word list. The time complexity of this is quite high, tho so if it needs to go fast then throw a hash table on it or something :)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Johan Kuylenstierna