'remove capital words in Greek
I am working on an NLP task and I want to remove from my dataset the words with full capital letters.
For example:
Input: 'Ο ΚΩΣΤΑΣ. Θελει να ΠΑΙΞΕΙ ΑΎΡΙΟ ποδόσφαιρο στο ΣΧολείο.'
Output: 'Θελει να ποδόσφαιρο στο ΣΧολείο.'
I have found the following function, but it's not working exactly as I want in my problem. Is there any way to adopt it?
# Remove the titles of the texts
def remove_titles(text):
greek_capital_chars = set(chr(cp) for cp in range(0x0370, 0x1FFF) if "GREEK CAPITAL" in unicodedata.name(chr(cp), ""))
s = text.split('.')
s = [i for i in s if not all([k in greek_capital_chars for k in i if k!=' '])]
return '.'.join(s)
Input: 'Ο ΚΩΣΤΑΣ. Θελει να ΠΑΙΞΕΙ ΑΎΡΙΟ ποδόσφαιρο στο ΣΧολείο.'
Output: 'Θελει να ΠΑΙΞΕΙ ΑΎΡΙΟ ποδόσφαιρο στο ΣΧολείο'
So it cant remove the capital words in the middle and also the dot is removed from the end.
Update: I changes my function to this, but I still don't have what I want. See the following example
def remove_sent_capital(x):
greek_capital_chars = set(chr(cp) for cp in range(0x0370, 0x1FFF) if "GREEK CAPITAL" in unicodedata.name(chr(cp), ""))
s = x.split(' ')
s = [i for i in s if not all([k in greek_capital_chars for k in i if k!=' '])]
return ' '.join(s)
Input: 'Ο ΚΩΣΤΑΣ. Θελει να ΠΑΙΞΕΙ. ΑΎΡΙΟ ποδόσφαιρο στο ΣΧολείο.'
Output: 'ΚΩΣΤΑΣ. Θελει να ΠΑΙΞΕΙ. ποδόσφαιρο στο ΣΧολείο.'
Output that I want: 'Θελει να ποδόσφαιρο στο ΣΧολείο.'
Solution 1:[1]
You are splitting the text sentences by sentences (text.split('.')).
So as is, your code remove sentences that contain only greek capital, not words.
In the example you give, only the first sentence is fully capitalised, so the second is keeped.
For the dots that are removed, you join only one sentence (the second one), so there is no need for python to add a join character (dot in your case).
What you actually want to do is:
- split the text in sentences
- then split the sentence by word (
words = s.split()should do the job, by default python split on white-spaces) - then apply your filter to the list of words
- you then only need to reconstruct the sentence (
" ".join(words)will do) - and finally add a dot to the end of each sentence and concatenate them.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
