'Twitter data analysis
I have a question for my Thesis project. In order to do a sentiment analysis, I would like to eliminate all hashtags, but with this Python code I remove only the "#". I would like to remove also the word associated to "#". Thanks everyone
df['text']=df['text'].apply(lambda x:' '.join(re.findall(r'\w+', x)))
Solution 1:[1]
Assuming you want the rest of the words after the hashtag to remain intact, try this:
import re
df['text']=df['text'].apply(lambda x:(re.sub("#([\S]+)",'',x)))
It will remove any word(s) after the # until the next whitespace.
Solution 2:[2]
You can use the re.sub method. Something like that:
df["text"] = df["text"].apply (lambda x : re.sub (r"#.*\s", "", x))
In this way you replace everything that matches the pattern "#.*\s" (hashtag followed by any amount of characters followed by a space) with an empty string. You may need to tweak the regex a bit depending on your data.
Check the documentation about the re module here: https://docs.python.org/3/library/re.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Tanuj Wadhi |
| Solution 2 | Liutprand |
