'tf BertTokenizer detokenize doesn't match lookup from vocab
Following https://www.tensorflow.org/text/guide/subwords_tokenizer but on a different dataset (wmt15 de_en). Created the vocab file as mentioned. When I load the vocab using
en_tokenizer = tensorflow_text.BertTokenizer('en_vocab.txt', **bert_tokenizer_params)
Then:
tokens = en_tokenizer.tokenize(['Hello TensorFlow!'])
merged = tokens.merge_dims(-2,-1)
en_tokenizer.detokenize(merged)
returns
<tf.RaggedTensor [[b'hello', b'tensorflow', b'!']]>
Which is correct. But lookup doesn't get the same tokens. Loading manually from vocab also doesn't work
vocab = pathlib.Path('en_vocab.txt').read_text().splitlines()
vocab = tf.Variable(vocab)
tf.gather(vocab, merged)
<tf.RaggedTensor [[b'many', b'labor', b'surgery', b'major', b'doctors', b'!']]>
When I look at the ids manually from the vocab file I also get these wrong tokens. How does detokenize get correct words but I can't from vocab. And how to write a lookup function using this locab that can get me correct tokens. Alternatively, can I get a lookup function from en_tokenizer?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
