'How to encode categorical data for a multiclass classification in Tensorflow to avoid Shape mismatch

I have a dataframe with a column for article body text, and a column for topic labels topic. topic contains lists of labels.

>>> df.topic.head(5)
0    [ECONOMIC PERFORMANCE, ECONOMICS, EQUITY MARKE...
1          [CAPACITY/FACILITIES, CORPORATE/INDUSTRIAL]
2    [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE/IND...
3    [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE/IND...
4    [STRATEGY/PLANS, NEW PRODUCTS/SERVICES, CORPOR...
Name: topic, dtype: object

I first exploded topic using df = df.explode('topic').reset_index(drop=True) in order to put on label per row.

>>> df.head(5)
                                                text                 topic
0  Emerging evidence that Mexico economy was back...  ECONOMIC PERFORMANCE
1  Emerging evidence that Mexico economy was back...             ECONOMICS
2  Emerging evidence that Mexico economy was back...        EQUITY MARKETS
3  Emerging evidence that Mexico economy was back...          BOND MARKETS
4  Emerging evidence that Mexico economy was back...               MARKETS

Currently, I am encoding topic using

from sklearn.preprocessing import LabelEncoder

# Encode labels
le = LabelEncoder()
df['topic_encoded'] = le.fit_transform(df['topic'])

Then I am using scikit's train_test_split, tokenizing the data using distilbert, and converting it a tensorflow dataset:

# Split dataset into train, test, val (70, 15, 15)
train, test = train_test_split(df, test_size=0.15)
train, val = train_test_split(train, test_size=0.15)
# Create new index
train_idx = [i for i in range(len(train.index))]
test_idx = [i for i in range(len(test.index))]
val_idx = [i for i in range(len(val.index))]

# Convert to numpy
x_train = train['text'].values[train_idx]
x_test = test['text'].values[test_idx]
x_val = val['text'].values[val_idx]

y_train = train['topic_encoded'].values[train_idx]
y_test = test['topic_encoded'].values[test_idx]
y_val = val['topic_encoded'].values[val_idx]

# Tokenize datasets
tr_tok = tokenizer(list(x_train), return_tensors='tf', truncation=True, padding=True, max_length=128)
val_tok = tokenizer(list(x_val), return_tensors='tf', truncation=True, padding=True, max_length=128)
test_tok = tokenizer(list(x_test), return_tensors='tf', truncation=True, padding=True, max_length=128)

# Convert dfs to tds
train_ds = tf.data.Dataset.from_tensor_slices((dict(tr_tok), y_train))
val_ds = tf.data.Dataset.from_tensor_slices((dict(val_tok), y_val))
test_ds = tf.data.Dataset.from_tensor_slices((dict(test_tok), y_test))

My test model is pretty simple:

# Set up model
# Recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# 1 epochs for illustration, though multiple epochs might be better as long as we don't overfit
number_of_epochs = 1

# Model initialization
model = TFDistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=df['topic_encoded'].unique().shape[0]
)

print(df['topic_encoded'].unique().shape[0])

# Optimizer Adam recommended
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)

# We do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])


bert_history = model.fit(
    train_ds,
    epochs=number_of_epochs,
    validation_data=test_ds)

Except I get a ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (128, 91)). on the line

bert_history = model.fit(
    train_ds,
    epochs=number_of_epochs,
    validation_data=test_ds)

I assume this has to do with the way the data is encoded, but I am not sure how to fix this issue. Should this be one-hot encoded instead of label encoded? Any help would be great!



Solution 1:[1]

num_labels=df['topic_encoded'].unique().shape[0]

do not get unique - in such a way you are reducing the shape of your labels in its size... the errorValue tells you exactly about this "Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits"... Be sure you understand logits meaning -- it is the vector of your dataset's features [actually their predictions result] & it should have the same size as your labels vector = all together resembles a table with column of labels & columns of features - row by row, not unique

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1