'Twitter data analysis

I have a question for my Thesis project. In order to do a sentiment analysis, I would like to eliminate all hashtags, but with this Python code I remove only the "#". I would like to remove also the word associated to "#". Thanks everyone

df['text']=df['text'].apply(lambda x:' '.join(re.findall(r'\w+', x)))



Solution 1:[1]

Assuming you want the rest of the words after the hashtag to remain intact, try this:

import re
df['text']=df['text'].apply(lambda x:(re.sub("#([\S]+)",'',x)))

It will remove any word(s) after the # until the next whitespace.

Solution 2:[2]

You can use the re.sub method. Something like that:

df["text"] = df["text"].apply (lambda x : re.sub (r"#.*\s", "", x))

In this way you replace everything that matches the pattern "#.*\s" (hashtag followed by any amount of characters followed by a space) with an empty string. You may need to tweak the regex a bit depending on your data.

Check the documentation about the re module here: https://docs.python.org/3/library/re.html

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tanuj Wadhi
Solution 2 Liutprand