'Low accuracy after testing hyperparameters

I am using VGG19 pre-trained model with ImageNet weights to do transfer-learning on 4 classes with keras. However I do not know if there really is a difference between these 4 classes, I'd like to discover it. The goal would be to discover if these classes make sense or if there is no difference between these images classes.

These classes are made up of abstract paintings from the same individual.

I tried different models with different hyperparameters (Adam/SGD, learning rate, dropout, l2 regularization, FC layers size, batch size, unfreeze, and also weighted classes as the data is a little bit unbalanced

batch_size = 32
unfreeze = 17
dropout = 0.2
fc = 256
lr = 1e-4
l2_reg = 0.1

train_datagen = ImageDataGenerator(
    preprocessing_function = preprocess_input,
    horizontal_flip=True,
    vertical_flip=True,
    fill_mode='nearest'
)
test_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)
train_generator = train_datagen.flow_from_directory(
        'C:/Users/train',
        target_size=(224, 224),
        batch_size=batch_size,
        class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
        'C:/Users/test',
        target_size=(224, 224),
        batch_size=batch_size,
        class_mode='categorical')

base_model = VGG19(
    weights="imagenet",
    input_shape=(224, 224, 3),
    include_top=False,
)

last_layer = base_model.get_layer('block5_pool')
last_output = last_layer.output

x = Flatten()(last_output)
x = GlobalMaxPooling2D()(last_output)
x = Dense(fc)(x)
x = Activation('relu')(x)
x = BatchNormalization()(x)
x = Dropout(dropout)(x)
x = Dense(fc, activation='relu', kernel_regularizer = regularizers.l2(l2=l2_reg))(x)
x = layers.Dense(4, activation='softmax')(x)

model = Model(base_model.input, x)
for layer in model.layers:
    layer.trainable = False
for layer in model.layers[unfreeze:]:
    layer.trainable = True

model.compile(loss='categorical_crossentropy',
      optimizer=optimizers.SGD(learning_rate = lr),
      metrics=['accuracy'])
class_weights = class_weight.compute_class_weight('balanced',
                         np.unique(train_generator.classes),
                         train_generator.classes)
class_weights_dict = dict(enumerate(class_weights))
history = model.fit(train_generator, epochs=epochs, validation_data=validation_generator,
                    validation_steps=392//batch_size,
                    steps_per_epoch=907//batch_size)
plot_model_history(history)

I also did feature extractions at every layer, and fed the extracted features to a SVM (for each layer), and the accuracy of these SVM was about 40%, which is higher than this model (30 to 33%). So, I may be wrong but I think this model could achieve a higher accuracy.

I have a few questions about my model.

First, is my code correct, or am I doing something wrong ?

If the validation set accuracy for a 4-classes classification task is ~30% (assuming the data are balanced or weighted), is it likely or very not likely to be able to improve it to something significantly better with other hyperparameters ?

What else can I try to have a better accuracy ?

When and how can I conclude that these classes do not make sense ?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source