'Python Polars regex - remove non english, keep numbers punctuations and emojis

I have python code for the task.

import re
import string

emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
shrink_whitespace_reg = re.compile(r'\s{2,}')

def clean_text(raw_text):
    reg = re.compile(r'({})|[^a-zA-Z0-9 -{}]'.format(emoji_pat,r"\\".join(list(string.punctuation)))) # line a
    result = reg.sub(lambda x: ' {} '.format(x.group(1)) if x.group(1) else ' ', raw_text)
    return shrink_whitespace_reg.sub(' ', result).lower()

I tried to use the polars polars.internals.series.StringNameSpace.contains

But I got an exceptions 
ComputeError: regex error: Syntax(

regex parse error:
    ([🌀-🙏🚀-🛿☀-⛿✀-➿])|[^a-zA-Z0-9 -!\\"\\#\\$\\%\\&\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\@\\[\\\\\]\\^\\_\\`\\{\\}\\~]
                     ^^
error: unclosed character class

Examples with chinese english and unknown

texts = ['水虫対策にはコレが一番ですね','🙏🚀','I love polars!-ã„ã¤ã‚‚ã•らã•ら.','So good 👍.']
df = pd.DataFrame({'text':texts})

d = df.text.apply(clean_text)

expected:

0                    
1                🙏 🚀 
2    i love polars! .
3         so good 👍 .
Name: text, dtype: object

Another question:

Is it faster than use re?



Solution 1:[1]

import polars as pl

emoji_pat = "[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]"

texts = ['????????????????','??','I |love|  polars!-ã„ã¤ã‚‚ã•らã•ら.','So good     ?  .']

df = pl.DataFrame(pl.Series("text", texts))

In [78]: df
Out[78]:
shape: (4, 1)
???????????????????????????????????????
? text                                ?
? ---                                 ?
? str                                 ?
???????????????????????????????????????
? ????????????????        ?
???????????????????????????????????????
? ??                                ?
???????????????????????????????????????
? I |love|  polars!-ã„ã¤ã‚‚ã•らã•... ?
???????????????????????????????????????
? So good     ?  .                   ?
???????????????????????????????????????

# Add cleaned column (rust regex requires "[" inside [] to be escaped).
df_cleaned = df.with_column(
    pl.col("text").str.replace_all(
        "[^a-zA-Z0-9 " + string.punctuation.replace("[", "\[") + emoji_pat + "]+",
        ""
    ).str.replace_all(
        "\s{2,}", " "
    ).str.to_lowercase().alias("text_cleaned")
)

In[79]: df_cleaned
Out[79]:
shape: (4, 2)
????????????????????????????????????????????????????????????
? text                                ? text_cleaned       ?
? ---                                 ? ---                ?
? str                                 ? str                ?
????????????????????????????????????????????????????????????
? ????????????????        ?                    ?
????????????????????????????????????????????????????????????
? ??                                ? ??               ?
????????????????????????????????????????????????????????????
? I |love|  polars!-ã„ã¤ã‚‚ã•らã•... ? i |love| polars!-. ?
????????????????????????????????????????????????????????????
? So [good]     ?  .                 ? so [good] ? .     ?
????????????????????????????????????????????????????????????

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ghuls