'Using BERT Embeddings for text classification

I am trying to automatically detect whether a text is written by a Machine or a Human.

My first approach was using a TF-IDF to build features for a logistic regression classifier, where I got an accuracy of around 60%.

Now, I'm trying to obtain the features from BERT, as it was suggested suggested that it can be a good predictor.

I did so without fine-tuning the model, but the results of the logistic regression after using BERT are very bad - around 45%, i.e worse than a dummy classifier. This is my code so far (I am doing it in batches since I get GPU OOM message if I try to pass everything at once):

batch_size = 200
k = (training_set['document'].count() + 1)//batch_size + 1
results = []
for i in range(1,k):
  print("Batch "+str(i)+" out of "+str(k-1))
  current_batch = training_set.iloc[(i-1)*batch_size:i*batch_size]
  # Tokenizing with lambda function
  tokenized = current_batch['document'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True,max_length=512,truncation=True)))
  # Padding
  # Padding will be very simply finding the largest document and then filling all the smaller documents with zero to have a square matrix
  largest_size = 0
  for document in tokenized.values:
      if len(document) > largest_size:
          largest_size = len(document)
  # Padding
  padded = np.array([document + [0]*(largest_size-len(document)) for document in tokenized.values])

  # Create Mask for where the padding was added to pass it to the model
  attention_mask = np.where(padded != 0, 1, 0)


  # Execute Model
  input_ids = torch.tensor(padded).to(device)
  attention_mask = torch.tensor(attention_mask).to(device)

  # Run forward on the Neural network
  with torch.no_grad():
      last_hidden_states = model(input_ids, attention_mask=attention_mask)
  # Concatenate the results:
  for element in last_hidden_states[0][:,0,:].cpu().numpy():
    results.append(element)
  # Clean Cuda Cache
  gc.collect()
  torch.cuda.empty_cache()

My training set is a pandas dataframe whose columns are [id,document,label] and my code for the logistic regression classifier is very straightforward:

features = results
labels = training_set['label']
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source