'How to encode categorical data for a multiclass classification in Tensorflow to avoid Shape mismatch
I have a dataframe with a column for article body text, and a column for topic labels topic. topic contains lists of labels.
>>> df.topic.head(5)
0 [ECONOMIC PERFORMANCE, ECONOMICS, EQUITY MARKE...
1 [CAPACITY/FACILITIES, CORPORATE/INDUSTRIAL]
2 [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE/IND...
3 [PERFORMANCE, ACCOUNTS/EARNINGS, CORPORATE/IND...
4 [STRATEGY/PLANS, NEW PRODUCTS/SERVICES, CORPOR...
Name: topic, dtype: object
I first exploded topic using df = df.explode('topic').reset_index(drop=True) in order to put on label per row.
>>> df.head(5)
text topic
0 Emerging evidence that Mexico economy was back... ECONOMIC PERFORMANCE
1 Emerging evidence that Mexico economy was back... ECONOMICS
2 Emerging evidence that Mexico economy was back... EQUITY MARKETS
3 Emerging evidence that Mexico economy was back... BOND MARKETS
4 Emerging evidence that Mexico economy was back... MARKETS
Currently, I am encoding topic using
from sklearn.preprocessing import LabelEncoder
# Encode labels
le = LabelEncoder()
df['topic_encoded'] = le.fit_transform(df['topic'])
Then I am using scikit's train_test_split, tokenizing the data using distilbert, and converting it a tensorflow dataset:
# Split dataset into train, test, val (70, 15, 15)
train, test = train_test_split(df, test_size=0.15)
train, val = train_test_split(train, test_size=0.15)
# Create new index
train_idx = [i for i in range(len(train.index))]
test_idx = [i for i in range(len(test.index))]
val_idx = [i for i in range(len(val.index))]
# Convert to numpy
x_train = train['text'].values[train_idx]
x_test = test['text'].values[test_idx]
x_val = val['text'].values[val_idx]
y_train = train['topic_encoded'].values[train_idx]
y_test = test['topic_encoded'].values[test_idx]
y_val = val['topic_encoded'].values[val_idx]
# Tokenize datasets
tr_tok = tokenizer(list(x_train), return_tensors='tf', truncation=True, padding=True, max_length=128)
val_tok = tokenizer(list(x_val), return_tensors='tf', truncation=True, padding=True, max_length=128)
test_tok = tokenizer(list(x_test), return_tensors='tf', truncation=True, padding=True, max_length=128)
# Convert dfs to tds
train_ds = tf.data.Dataset.from_tensor_slices((dict(tr_tok), y_train))
val_ds = tf.data.Dataset.from_tensor_slices((dict(val_tok), y_val))
test_ds = tf.data.Dataset.from_tensor_slices((dict(test_tok), y_test))
My test model is pretty simple:
# Set up model
# Recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# 1 epochs for illustration, though multiple epochs might be better as long as we don't overfit
number_of_epochs = 1
# Model initialization
model = TFDistilBertForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=df['topic_encoded'].unique().shape[0]
)
print(df['topic_encoded'].unique().shape[0])
# Optimizer Adam recommended
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# We do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
bert_history = model.fit(
train_ds,
epochs=number_of_epochs,
validation_data=test_ds)
Except I get a ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (128, 91)). on the line
bert_history = model.fit(
train_ds,
epochs=number_of_epochs,
validation_data=test_ds)
I assume this has to do with the way the data is encoded, but I am not sure how to fix this issue. Should this be one-hot encoded instead of label encoded? Any help would be great!
Solution 1:[1]
num_labels=df['topic_encoded'].unique().shape[0]
do not get unique - in such a way you are reducing the shape of your labels in its size... the errorValue tells you exactly about this "Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits"... Be sure you understand logits meaning -- it is the vector of your dataset's features [actually their predictions result] & it should have the same size as your labels vector = all together resembles a table with column of labels & columns of features - row by row, not unique
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
