'transformers and BERT downloading to your local machine

I am trying to replicates the code from this page.

At my workplace we have access to transformers and pytorch library but cannot connect to internet from our python environment. Could anyone help with how we could get the script working after manually downloading files to my machine?

my specific questions are -

  1. should I go to the location bert-base-uncased at main and download all the files? Do I have put them in a folder with a specific name?

How should I change the below code

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)

How should I change the below code

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',

                                  output_hidden_states = True, # Whether the model returns all hidden-states.

                              )

Please let me know if anyone has done this…thanks

###update1

I went to the link and manually downloaded all files to a folder and specified path of that folder in my code. Tokenizer works but this line model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True, # Whether the model returns all hidden-states. ) fails. Any idea what should i do? I noticed that 4 big files when downloaded have very strange name...should I rename them to same names as shown on the above page? Do I need to download any other files?

the error message is OSErrr: unable to load weights from pytorch checkpoint file for bert-base-uncased2/ at bert-base-uncased/pytorch_model.bin If you tried to load a pytroch model from a TF 2 checkpoint, please set from_tf=True



Solution 1:[1]

clone the model repo for downloading all the files

git lfs install
git clone https://huggingface.co/bert-base-uncased

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

git usage:

  1. download git from here https://git-scm.com/downloads

  2. paste these to your cli(terminal):
    a. git lfs install
    b. git clone https://huggingface.co/bert-base-uncased

  3. wait for download, it will take time. if you want monitor your web performance

  4. find the current directory simply pasting cd to your cli and get the file path(e.g "C:/Users/........./bert-base-uncased" )

  5. use it as:

     from transformers import BertModel, BertTokenizer
     model = BertModel.from_pretrained("C:/Users/........./bert-base-uncased")
     tokenizer = BertTokenizer.from_pretrained("C:/Users/........./bert-base-uncased")
    

Manual download, without git:

  1. Download all the files from here https://huggingface.co/bert-base-uncased/tree/main

  2. Put them in a folder named "yourfoldername"

  3. use it as:

     model = BertModel.from_pretrained("C:/Users/........./yourfoldername")
     tokenizer = BertTokenizer.from_pretrained("C:/Users/........./yourfoldername")
    

For only model(manual download, without git):

  1. just click the download button here and download only pytorch pretrained model. its about 420mb https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin

  2. download config.json file from here https://huggingface.co/bert-base-uncased/tree/main

  3. put both of them in a folder named "yourfilename"

  4. use it as:

     model = BertModel.from_pretrained("C:/Users/........./yourfilename")
    

Solution 2:[2]

Answering "###update1" for the error: 'OSErrr: unable to load weights from pytorch checkpoint file for bert-base-uncased2/ at bert-base-uncased/pytorch_model.bin If you tried to load a pytroch model from a TF 2 checkpoint, please set from_tf=True'

Please try this methods from -> https://huggingface.co/transformers/model_doc/bert.html

from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("C:/Users/........./bert-base-uncased")
model = BertForMaskedLM.from_pretrained("C:/Users/........./bert-base-uncased")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits

if this works we understand that there is nothing wrong with filesystem or foldernames.

If it works try to get hiddenstate after(note that bert model already returns hiddenstate as explained: " The bare Bert Model transformer outputting raw hidden-states without any specific head on top." so you dont need to use "output_hidden_states = True,")

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("C:/Users/........./bert-base-uncased")
model = BertModel.from_pretrained("C:/Users/........./bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

if this not works try to load pytorch model with one of these methods

# Load all tensors onto the CPU
torch.load("C:/Users/........./bert-base-uncased/pytorch_model.bin", map_location=torch.device('cpu'))
# Load all tensors onto GPU 1
torch.load("C:/Users/........./bert-base-uncased/pytorch_model.bin", map_location=lambda storage, loc: storage.cuda(1))

if pytorch load method is not worked, we understand that there is pytorch version compatibility problem between pytorch 1.4.0 and released bert pytorch model. Or maybe your pytorch_model.bin file not downloaded very well. And please pay attention when pytorch 1.4.0 released the last python was python3.4

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ynjxsjmh
Solution 2