'When doing pre-training of a transformer model, how can I add words to the vocabulary?

Given a DistilBERT trained language model for a given language, taken from the Huggingface hub, I want to pre-train the model on a specific domain, and I want to add new words that are:

  • definitely non existing in the original training set
  • and impossible to handle via word piece toeknization - basically you can think of these words as "codes" that are a normalized form of a named entity

Consider that:

  • I would like to avoid to learn a new tokenizer: I am fine to add the new words, and then let the model learn their embeddings via pre-training
  • the number of the "words" is way larger that the "unused" tokens in the "stock" vocabulary

The only advice that I have found is the one reported here:

Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

Do you think this is the only way of achieve my goal?

If yes, I do not have any idea of how to write this "script": does someone has some hints at how to proceeed (sample code, documentation etc)?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source