'Are there pretrained embeddings for sub word tokenizers like sentencepiece or wordpiece?

So I've been trying to figure out a way to get pretrained embeddings for subword tokenizers like sentencepiece or wordpiece but have been unsuccessful. Do pretrained embeddings for these exist? Is there a library that can take a corpus and produce subword embedding for any given sentence.

I have this conjecture that using a subword tokenizer would work a lot better than traditional tokenizers for my task, but I am unable to understand how to transform the subword tokens to embeddings. I don't want to use a traditional BERT architecture due to its massive size and hence looking for an alternative.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source