'Cleaning up tweets before sentiment analysis on cryptocurrencies

I am trying to analyse twitter sentiment and right now I have a code which scrapes tweets from twitter with API and puts them in an excel file followed by their sentiment score, however I want to try and clean these tweets before they are put into the excel file and before they are analysed through the google cloud NLP.

below is the code that scrapes the tweets

    while tweetCount < maxTweets:
    if(maxId <= 0):
            newTweets = api.search_tweets(q=hashtag, count=tweetsPerQry, lang="en",
    result_type="recent", tweet_mode="extended")
    else:
                newTweets =  api.search_tweets(q=hashtag, count=tweetsPerQry, lang="en",
    max_id=str(maxId -1), result_type="recent", tweet_mode="extended")

    if not newTweets:
            print("Tweet habis")
            break

    for tweet in newTweets:
            d={}
            d["text"] = tweet.full_text.encode('utf-8')
            print (d["text"])
            listposts.append(d)

    tweetCount += len(newTweets)
    maxId = newTweets[-1].id
    print (listposts)

and below is the code that cleans tweets

def clean_tweet(tweet):
if type(tweet) == np.float:
    return ""
temp = tweet.lower()
temp = re.sub("'", "", temp) # to avoid removing contractions in english
temp = re.sub("@[A-Za-z0-9_]+","", temp)
temp = re.sub("#[A-Za-z0-9_]+","", temp)
temp = re.sub(r'http\S+', '', temp)
temp = re.sub('[()!?]', ' ', temp)
temp = re.sub('\[.*?\]',' ', temp)
temp = re.sub("[^a-z0-9]"," ", temp)
temp = temp.split()
temp = [w for w in temp if not w in stopwords]
temp = " ".join(word for word in temp)
return temp

tweets = [""]

results = [clean_tweet(tw) for tw in tweets]
results

somehow I want to combine these two pieces of code so that the tweets go into the clean_tweet function before being put into the excel file and thus also being analysed on sentiment.



Solution 1:[1]

Suppose you have the following function to clean tweets (this is just your function with indentation fixed):

def clean_tweet(tweet):
    if type(tweet) == np.float:
        return ""
    temp = tweet.lower()
    temp = re.sub("'", "", temp) # to avoid removing contractions in english
    temp = re.sub("@[A-Za-z0-9_]+","", temp)
    temp = re.sub("#[A-Za-z0-9_]+","", temp)
    temp = re.sub(r'http\S+', '', temp)
    temp = re.sub('[()!?]', ' ', temp)
    temp = re.sub('\[.*?\]',' ', temp)
    temp = re.sub("[^a-z0-9]"," ", temp)
    temp = temp.split()
    temp = [w for w in temp if not w in stopwords]
    temp = " ".join(word for word in temp)
    return temp

Your other block of code has some problems that I'm not fixing here; for example, listposts is never defined, before you try to append to it. You also might have some indentation issues. With that in mind, you can call the function as follows:

tweetCount = 0

while tweetCount < maxTweets:
    if(maxId <= 0):
            newTweets = api.search_tweets(q=hashtag, count=tweetsPerQry, lang="en",
    result_type="recent", tweet_mode="extended")
    else:
            newTweets =  api.search_tweets(q=hashtag, count=tweetsPerQry, lang="en",
    max_id=str(maxId -1), result_type="recent", tweet_mode="extended")

    if not newTweets:
            print("Tweet habis")
            break

    for tweet in newTweets:
            print("Tweet before cleaning:", tweet, sep="\n")
            tweet = clean_tweet(tweet)
            print("Tweet after cleaning:", tweet, sep="\n")
            listposts.append(d)

    tweetCount += len(newTweets)
    maxId = newTweets[-1].id
    # Instead of just printing listposts, this is where I would
    # write each entry of it to a line in a csv file

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Elijah Cox