'My sentiment analysis model, doesn't remember the training set's sentiment
So I have been trying to build a sentiment analysis tool. I will paste my code with comments, just so you see my thought process and then later hightlight where the problem is happening.
import os
import re
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Embedding, Dropout
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# opening the data
data = pd.read_csv("tweetPNN.txt", sep="\t", names=['text','sentiment'], encoding='utf-8')
# some data cleaning
data['text'].apply(lambda x: x.lower())
data['text'] = data['text'].apply(lambda x: re.sub('[^a-zšžčćđA-zŠŽČĆĐ0-9\s]', '', x))
data['text'].head()
tokenizer = Tokenizer(num_words=5000, split=" ")
tokenizer.fit_on_texts(data['text'].values)
# padding our text vector
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X) # padding our text vector so they all have the same length
# making sure my GPU doesn't run out of memory
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'
# defining the model
model = Sequential()
model.add(Embedding(5000, 256, input_length=X.shape[1]))
model.add(Dropout(0.2))
model.add(LSTM(180, return_sequences=True, dropout=0.3, recurrent_dropout=0.2))
model.add(LSTM(180, dropout=0.3, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# checkup
model.summary()
#transfering sentiment like: pos-1 0 0, neut-0 1 0, neg-0 0 1
y = pd.get_dummies(data['sentiment']).values
#checking if it worked
[print(data['sentiment'][i], y[i]) for i in range(0,5)]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#training the model
batch_size = 28
epochs = 6
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1)
model.save('sentitest3.h5')
predictions = model.predict(X_test)
#checking the predictions
[print(data['text'][i], predictions[i], y_test[i]) for i in range(200, 250)]
So the problem appears to happen when I check the predictions. It is supposed to give the tools prediction and the real sentiment value of the sentence (tweet actually). However, after double checking for sentences which have the same sentiment it keeps returning different "real values. For example:
[0.18112019 0.47002175 0.3488581 ] [0 1 0]
[0.8804077 0.08246485 0.0371275 ] [1 0 0]
Even thought these two sentences have different real values according to the analysis, in the data they both have neutral sentiments. Why is this happening? Also, is there an elegant way to compare the the test data, so I can measure precission, recall and F-measure?
Thanks,
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
