'Matching Windows-1251 encoding character set in RegEx

I need to create a regular expression, that would match only the characters NOT in the windows-1251 encoding character set, to detect if there are any characters in a given piece of text that would violate the encoding. I tried to do it through the [^\u0000-\u044F]+ expression, however it is also matching some characters that are actually in line with the encoding.

Appreciate any help on the issue



Solution 1:[1]

No language specified, but in Python no need for a regex with sets. Create a set of all Unicode code points that are members of Windows-1251 and subtract it from the set of the text. Note that only byte 98h is not used in Windows-1251 encoding:

>>> # Create the set of characters in code page 1251
>>> cp1251 = set(bytes(range(256)).decode('cp1251',errors='ignore'))
>>> set('This is a test \x98 ?') - cp1251
{'\x98', '?'}

As a regular expression:

>>> import re
>>> text = ''.join(cp1251) # string of all Windows-1251 codepoints from previous set
>>> text
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xa0¤¦§©«¬\xad®°±µ¶·»??????????????????????????????????????????????????????????????????????????????????????????????–—‘’‚“”„†‡•…‰‹›€?™'
>>> not_cp1251 = re.compile(r'[^\x00-\x7f\xa0\xa4\xa6\xa7\xa9\xab-\xae\xb0\xb1\xb5-\xb7\xbb\u0401-\u040c\u040e-\u044f\u0451-\u045c\u045e\u045f\u0490\u0491\u2013\u2014\u2018-\u201a\u201c-\u201e\u2020-\u2022\u2026\u2030\u2039\u203a\u20ac\u2116\u2122]')
>>> not_cp1251.findall(text) # all cp1251 text finds no outliers
[]
>>> not_cp1251.findall(text+'\x98') # adding known outlier
['\x98']
>>> not_cp1251.findall('??'+text+'\x98') # adding other outliers
['?', '?', '\x98']

Solution 2:[2]

Still unsure why my code above wouldn't work but I found the following work around to be effective.

def detect_sentiment(text,language):
    comprehend = boto3.client(service_name='comprehend', region_name='us-west-2')
    if (language == 'en') | (language == 'es') | (language == 'fr') | (language == 'de') :
        sentiment_analysis = comprehend.detect_sentiment(Text=text, LanguageCode=language)
        return sentiment_analysis
    else:
        return None

detect_sentiment_udf = F.udf(detect_sentiment)

reviews_4 = reviews_3.withColumn('RAW_SENTIMENT_SCORE', detect_sentiment_udf('SENTENCE','LANGUAGE'))

reviews_4.show(50)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 user3242036