'regex for words does not containing Cyrillic letters

I'd like to clean the string from any words, which does not contain at least one Cyrillic letter (by words I mean parts of string split by whitespace char)

I've tried line = re.sub(' *^[^а-яА-Я]+ *', ' ', line) where [а-яА-Я] is set of cyrrilic letters, but when processing string

 des поместья, de la famille Buonaparte. Non, je vous pr&#233;viens que si vous

it returns

поместья, de la famille Buonaparte. Non, je vous pr&#233;viens que si vous

instead of оf just

поместья

Solution 1:^[1]

You want to keep any non-whitespace chunks that contain at least one Cyrillic char in them.

You can str.split() the string and use unicodedata to check if at least one char is Cyrillic, and only keep those "words":

import unicodedata as ud
result = ' '.join([word for word in text.split() if any('CYRILLIC' in ud.name(c) for c in word)])
print(result) # => ????????,

If you also need to strip any punctuation use any of the solutions from Best way to strip punctuation from a string:

import string
result = ' '.join([word.translate(str.maketrans('', '', string.punctuation)) for word in text.split() if any('CYRILLIC' in ud.name(c) for c in word)])
print(result) # => ????????

See the Python demo online. Details:

[word.translate(str.maketrans('', '', string.punctuation)) for word in text.split() if any('CYRILLIC' in ud.name(c) for c in word)] - a list comprehension that
- text.split() splits the text into non-whitespace chunks
- if any('CYRILLIC' in ud.name(c) for c in word) - condition checking if the word contains at least one Cyrillic char
- word.translate(str.maketrans('', '', string.punctuation)) - takes the word if condition above is True and strips punctuation from it
' '.join(...) - joins the list items into a single space-separate string.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'regex for words does not containing Cyrillic letters

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]