'How to remove or filter non-english (chinese, korean, japanese, arabic) strings in list?
Here is an input example:
['ARTA Travel Group', 'Arta | آرتا', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', 'アート建築', 'Arte Brasil Bar & Grill', 'ArtPod Stage', 'Artpollo扫码', 'Artpollo阿波罗-价值最优的艺术品投资电商', '아트홀']
Like above list, I want to remove elements with CHINESE, KOREAN, JAPANESE, ARBIC.
And below is the expected output (english only):
['ARTA Travel Group', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', 'Arte Brasil Bar & Grill', 'ArtPod Stage']
Solution 1:[1]
You can use regex and search with unicode range. ™ belongs to Letterlike Symbols which ranges from 2100—214F; you can either include them all or just pick the specific ones.
import re
s = ['ARTA Travel Group', 'Arta | ????', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', '?????', 'Arte Brasil Bar & Grill', 'ArtPod Stage', 'Artpollo??', 'Artpollo???-????????????', '???']
result = [i for i in s if not re.findall("[^\u0000-\u05C0\u2100-\u214F]+",i)]
print (result)
['ARTA Travel Group', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', 'Arte Brasil Bar & Grill', 'ArtPod Stage']
Solution 2:[2]
Sorry, I can't comment on this post due to the reputation lock but here.
That question is answered here Detect strings with non English characters in Python
Hope this helps!
Solution 3:[3]
You can remove non-english strings in the list using the function isascii which was introduced in python 3.7. So the minimum requirement to use this function is that you must have the python version >= python 3.7.
def isEnglish(s):
return s.isascii()
print(isEnglish("Test"))
print(isEnglish("['ARTA Travel Group', 'Arta | ????', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', '?????', 'Arte Brasil Bar & Grill', 'ArtPod Stage', 'Artpollo??', 'Artpollo???-????????????', '???']"))
Output:
['ARTA Travel Group', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', 'Arte Brasil Bar & Grill', 'ArtPod Stage']
Solution 4:[4]
use regular expression. Put the chars need to the regular expression
c = ["ab cde", "test", "??"]
b = filter(lambda x: re.search("[a-zA-Z\s]+",x) is not None, c)
just give an idea to you.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Henry Yik |
| Solution 2 | mallocation |
| Solution 3 | Indu |
| Solution 4 | Zhd Zilin |
