'tf BertTokenizer detokenize doesn't match lookup from vocab

Following https://www.tensorflow.org/text/guide/subwords_tokenizer but on a different dataset (wmt15 de_en). Created the vocab file as mentioned. When I load the vocab using

en_tokenizer = tensorflow_text.BertTokenizer('en_vocab.txt', **bert_tokenizer_params)

Then:

tokens = en_tokenizer.tokenize(['Hello TensorFlow!'])
merged = tokens.merge_dims(-2,-1)
en_tokenizer.detokenize(merged)

returns

<tf.RaggedTensor [[b'hello', b'tensorflow', b'!']]>

Which is correct. But lookup doesn't get the same tokens. Loading manually from vocab also doesn't work

vocab = pathlib.Path('en_vocab.txt').read_text().splitlines()
vocab = tf.Variable(vocab)
tf.gather(vocab, merged)

<tf.RaggedTensor [[b'many', b'labor', b'surgery', b'major', b'doctors', b'!']]>

When I look at the ids manually from the vocab file I also get these wrong tokens. How does detokenize get correct words but I can't from vocab. And how to write a lookup function using this locab that can get me correct tokens. Alternatively, can I get a lookup function from en_tokenizer?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'tf BertTokenizer detokenize doesn't match lookup from vocab

Sources

Related Questions