'multilabel text clasification with 1D CNN

I'm working on authorship detection from text task, I'm doing this by using a data frame consisting of symbol n-gram that I created using about 110k of 147 different authors' texts and TfidfVectorizer. Example of data, I encode author name strings using sklearn label encoder, which converts strings to numbers Example of labels

  • Data seperated into: (99817, 1000) (11091, 1000) (99817,) (11091,)
  • Using model below my best results were after 7-8 iterations: loss: 0.7225 - accuracy: 0.8070 - val_loss: 1.3828 - val_accuracy: 0.6777, after that model starts to overfit.

Model:

model = Sequential( 
[
    Dense(300, activation="relu", input_shape=(Data_train.shape[-1],)),
    Dense(750, activation="relu"),
    BatchNormalization(),
    Dropout(0.5),    
    Dense(147, activation="softmax"),
])
model.compile(optimizer='adam', 
          loss=tf.keras.losses.SparseCategoricalCrossentropy(),
          metrics=['accuracy'])
history = model.fit(Data_train,
                    Labels_train,
                    epochs=10,
                    shuffle=True,
                    callbacks=[early_stopping])

I want to try to solve this task with CNN network, I have found similar example of my task using 1D network and tried to use it:

vocab_size = 1000
maxlen = 1000
batch_size = 32
embedding_dims = 10
filters = 16
kernel_size = 3
hidden_dims = 250
epochs = 10
early_stopping = EarlyStopping(patience=0.1)
    model = Sequential(
        [
            Embedding(vocab_size, embedding_dims, input_length=maxlen),
            Dropout(0.5),
            Conv1D(filters, kernel_size, padding='valid', activation='relu'),
            MaxPooling1D(),
            BatchNormalization(),
            Conv1D(filters, kernel_size, padding='valid', activation='relu'),
            MaxPooling1D(),
            Flatten(),
            Dense(hidden_dims, activation='relu'),
            Dropout(0.5),
            Dense(147, activation='softmax')
        ]
    )
    model.compile(optimizer='adam', 
                  loss=keras.losses.SparseCategoricalCrossentropy(),
                  metrics=['accuracy'])
    model.fit(Data_train, Labels_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=0.2,
              callbacks=[early_stopping])

I only manage to get these results, from the first iteration to the 5, accuracy and val_accuracy reach ~0.07 and stay the same, after 5 iterations this is what I got:

  • loss: 4.5621 - accuracy: 0.0701 - val_loss: 4.5597 - val_accuracy: 0.0702

Could someone help me to improve these models to get better results, especially CNN? any suggestions are welcome, if I need to provide anything more please let me know, thank you.



Solution 1:[1]

I have managed to solve my issue, instead of using an embedding layer, I transformed my data and labels adding additional dimensions and passing input shape directly to the convolutional layer

Data_train_tmp = tf.expand_dims(Data_train, -1)
Labels_train_tmp = tf.expand_dims(Labels_train, 1)

Conv1D(62, 5, activation='relu', input_shape=(Data_train.shape[1],1)), 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 hixi22745