'Parallelization BERT tokenizer in various CPUs and/or threads

I am trying to tokenize (using BERT's tokenizer from huggingface). I am running the script in a computer with 32 CPUs.

I have a for loop that for each filename in a list (approx. 2000 files) reads the file, and tokenizes its content. I am trying to parallelize in each CPU the tokenization to make it go faster.

I have tried joblib to parallelize the for loop, defining a delayed function for the tokenization and executing with joblib's Parallel. I have taken the file I/O out of the function so this does not affect the parallelization.

tokenizer = BertTokenizer.from_pretrained('bert-large-cased')

def encode_file(idx, file_name, text, tokenizer):
    inputs = tokenizer.encode(text, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
    return (file_name, (model(inputs).pooler_output[0].detach().numpy()))

number_of_cpu_total = joblib.cpu_count()

#with joblib.parallel_backend(backend="multiprocessing"):
#with joblib.parallel_backend(backend="threading"):
parallel = Parallel(n_jobs=number_of_cpu, verbose=10)
bert_encodings_list = parallel(delayed(encode_file)(idx, file_name,  open(parsed_dir+file_name).read(), tokenizer) for idx, file_name in enumerate(data.original_file.unique()))

This code does not improve the time execution of the whole loop. I have tried different backends (loky (default), multiprocessing, and threading) and different n_jobs.

How can I make the parallelization work? In my opinion, it has something to do which sharing resources between the different executions, but I cannot see where.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source