'Load a pre-trained model from disk with Huggingface Transformers

From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax:

  - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
  - (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.

So, I went to the model hub:

I found the model I wanted:

I downloaded it from the link they provided to this repository:

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English.

Stored it in:

  /my/local/models/cased_L-12_H-768_A-12/

Which contains:

 ./
 ../
 bert_config.json
 bert_model.ckpt.data-00000-of-00001
 bert_model.ckpt.index
 bert_model.ckpt.meta
 vocab.txt

So, now I have the following:

  PATH = '/my/local/models/cased_L-12_H-768_A-12/'
  tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

And I get this error:

>           raise EnvironmentError(msg)
E           OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E           
E           - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E           
E           - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file

Similarly for when I link to the config.json directly:

  PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
  tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

        if state_dict is None and not from_tf:
            try:
                state_dict = torch.load(resolved_archive_file, map_location="cpu")
            except Exception:
                raise OSError(
>                   "Unable to load weights from pytorch checkpoint file. "
                    "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
                )
E               OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

What should I do differently to get huggingface to use my local pretrained model?

Update to address the comments

YOURPATH = '/somewhere/on/disk/'

name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)

>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')

So all is saved, but then....

YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)

    "Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.


Solution 1:[1]

Where is the file located relative to your model folder? I believe it has to be a relative PATH rather than an absolute one. So if your file where you are writing the code is located in 'my/local/', then your code should be like so:

PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

You just need to specify the folder where all the files are, and not the files directly. I think this is definitely a problem with the PATH. Try changing the style of "slashes": "/" vs "\", these are different in different operating systems. Also try using ".", like so ./models/cased_L-12_H-768_A-12/ etc.

Solution 2:[2]

I had this same need and just got this working with Tensorflow on my Linux box so figured I'd share.

My requirements.txt file for my code environment:

tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3

I'm using Python 3.6.

I went to this site here which shows the directory tree for the specific huggingface model I wanted. I happened to want the uncased model, but these steps should be similar for your cased version. Also note that my link is to a very specific commit of this model, just for the sake of reproducibility - there will very likely be a more up-to-date version by the time someone reads this.

I manually downloaded (or had to copy/paste into notepad++ because the download button took me to a raw version of the txt / json in some cases... odd...) the following files:

  • config.json
  • tf_model.h5
  • tokenizer_config.json
  • tokenizer.json
  • vocab.txt

NOTE: Once again, all I'm using is Tensorflow, so I didn't download the Pytorch weights. If you're using Pytorch, you'll likely want to download those weights instead of the tf_model.h5 file.

I then put those files in this directory on my Linux box:

/opt/word_embeddings/bert-base-uncased/

Probably a good idea to make sure there's at least read permissions on all of these files as well with a quick ls -la (my permissions on each file are -rw-r--r--). I also have execute permissions on the parent directory (the one listed above) so people can cd to this dir.

From there, I'm able to load the model like so:

tokenizer:

# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")

layer/model weights:

# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")

Solution 3:[3]

In addition to config file and vocab file, you need to add tf/torch model (which has.h5/.bin extension) to your directory.

in your case, torch and tf models maybe located in these url:

torch model: https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin

tf model: https://cdn.huggingface.co/bert-base-cased-tf_model.h5

you can also find all required files in files and versions section of your model: https://huggingface.co/bert-base-cased/tree/main

Solution 4:[4]

bert model folder containd these files:

config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt

instaed of these if we require bert_config.json

 bert_model.ckpt.data-00000-of-00001
 bert_model.ckpt.index
 bert_model.ckpt.meta
 vocab.txt

then how to do

Solution 5:[5]

Here is a short ans.

tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt',local_files_only=True)
model = BertForMaskedLM.from_pretrained('/path/to/pytorch_model.bin',config='../config.json', local_files_only=True)

Usually config.json need not be supplied explicitly if it resides in the same dir.

Solution 6:[6]

you can use simpletransformers library. checkout the link for more detailed explanation.

    model = ClassificationModel(
    "bert", "dir/your_path"
)

Here I used Classification Model as an example. You can use it for many other tasks as well like question answering etc.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sameer Zahid
Solution 2 TaylorV
Solution 3
Solution 4 pooja
Solution 5 CKM
Solution 6