'Imbalanced Dataset Using Keras
I am building a classifying ANN with python and the Keras library. I am using training the NN on an imbalanced dataset with 3 different classes. Class 1 is about 7.5 times as prevalent as Classes 2 and 3. As remedy, I took the advice of this stackoverflow answer and set my class weights as such:
class_weight = {0 : 1,
1 : 6.5,
2: 7.5}
However, here is the problem: The ANN is predicting the 3 classes at equal rates!
This is not useful because the dataset is imbalanced, and predicting the outcomes as each having a 33% chance is inaccurate.
Here is the question: How do I deal with an imbalanced dataset so that the ANN does not predict Class 1 every time, but also so that the ANN does not predict the classes with equal probability?
Here is my code I am working with:
class_weight = {0 : 1,
1 : 6.5,
2: 7.5}
# Making the ANN
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
classifier = Sequential()
# Adding the input layer and the first hidden layer with dropout
classifier.add(Dense(activation = 'relu',
input_dim = 5,
units = 3,
kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate= 0.1))
#Adding the second hidden layer
classifier.add(Dense(activation = 'relu',
units = 3,
kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate = 0.1))
# Adding the output layer
classifier.add(Dense(activation = 'sigmoid',
units = 2,
kernel_initializer = 'uniform'))
# Compiling the ANN
classifier.compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
# Fitting the ANN to the training set
classifier.fit(X_train, y_train, batch_size = 100, epochs = 100, class_weight = class_weight)
Solution 1:[1]
The most evident problem that I see with your model is that it is not properly structured for classification. If your samples can belong to only one class at a time, then you should not overlook this fact by having a sigmoid activation as your last layer.
Ideally, the last layer of a classifier should output the probability of a sample belonging to a class, i.e. (in your case) an array [a, b, c] where a + b + c == 1..
If you use a sigmoid output, then the output [1, 1, 1] is a possible one, although it is not what you are after. This is also the reason why your model is not generalizing properly: given that you're not specifically training it to prefer "unbalanced" outputs (like [1, 0, 0]), it will defalut to predicting the average values that it sees during training, accounting for the reweighting.
Try changing the activation of your last layer to 'softmax' and the loss to 'catergorical_crossentropy':
# Adding the output layer
classifier.add(Dense(activation='softmax',
units=2,
kernel_initializer='uniform'))
# Compiling the ANN
classifier.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
If this doesn't work, see my other comment and get back to me with that info, but I'm pretty confident that this is the main problem.
Cheers
Solution 2:[2]
Imbalanced datasets (where classes are uneven or unequally distributed) are a prevalent problem in classification. For example, one class label has a very high number of observations, and the other has a pretty low number of observations. Significant causes of data imbalance include: Faulty data collection Domain peculiarity – when some domains have an imbalanced dataset.
Imbalanced datasets can create many problems in classification hence the need to improve datasets for robust models and improve performance.
Here are several methods to bring balance to imbalanced datasets:
Undersampling – works by resampling the majority class points in a dataset to match or make them equal to the minority class points. It brings equilibrium between the majority and minority classes so that the classifier gives equal importance to both classes. However, it’s important to note that undersampling may cause some loss of information hence some insignificant results.
Oversampling – Also known as upsampling, oversampling resamples the minority class to equal the total number of majority class points. It replicates the observations from minority class points to balance datasets.
Synthetic Minority Oversampling Technique – As the name suggests, the SMOTE technique uses oversampling to create artificial data points for minority classes. It creates new instances between the attributes of the minority class, which are synthesized from existing data.
Searching optimal value from a grid – This technique involves finding probabilities for a particular class label then finding the optimum threshold to map the possibilities to the correct class label.
Using the BalancedBaggingClassifier – The BalancedBaggingClassifier allows you to resample each subclass of a dataset before training a random estimator to create a balanced dataset.
Use different algorithms – Some algorithms aren’t effective in restoring balance in imbalanced datasets. Sometimes it’s wise to try different algorithms to stand a better chance at creating a balanced dataset and improving performance. For instance, you can employ regularization or penalized models to punish the wrong predictions on the minority class. The effects of imbalanced datasets can be significant. Hopefully, one of the approaches above can help you get in the right direction.
To test which approach works best for you, I’d suggest using deepchecks, an awesome open python package for validating data and models quickly.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Daniele Grattarola |
| Solution 2 | Buch133 |
