'Using BERT Embeddings for text classification
I am trying to automatically detect whether a text is written by a Machine or a Human.
My first approach was using a TF-IDF to build features for a logistic regression classifier, where I got an accuracy of around 60%.
Now, I'm trying to obtain the features from BERT, as it was suggested suggested that it can be a good predictor.
I did so without fine-tuning the model, but the results of the logistic regression after using BERT are very bad - around 45%, i.e worse than a dummy classifier. This is my code so far (I am doing it in batches since I get GPU OOM message if I try to pass everything at once):
batch_size = 200
k = (training_set['document'].count() + 1)//batch_size + 1
results = []
for i in range(1,k):
print("Batch "+str(i)+" out of "+str(k-1))
current_batch = training_set.iloc[(i-1)*batch_size:i*batch_size]
# Tokenizing with lambda function
tokenized = current_batch['document'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True,max_length=512,truncation=True)))
# Padding
# Padding will be very simply finding the largest document and then filling all the smaller documents with zero to have a square matrix
largest_size = 0
for document in tokenized.values:
if len(document) > largest_size:
largest_size = len(document)
# Padding
padded = np.array([document + [0]*(largest_size-len(document)) for document in tokenized.values])
# Create Mask for where the padding was added to pass it to the model
attention_mask = np.where(padded != 0, 1, 0)
# Execute Model
input_ids = torch.tensor(padded).to(device)
attention_mask = torch.tensor(attention_mask).to(device)
# Run forward on the Neural network
with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)
# Concatenate the results:
for element in last_hidden_states[0][:,0,:].cpu().numpy():
results.append(element)
# Clean Cuda Cache
gc.collect()
torch.cuda.empty_cache()
My training set is a pandas dataframe whose columns are [id,document,label] and my code for the logistic regression classifier is very straightforward:
features = results
labels = training_set['label']
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
