'How to use Huggingface Data Collator

I was following this tutorial which comes with this notebook.

I plan to use Tensorflow for my project, so I followed this tutorial and added the line

tokenized_datasets = tokenized_datasets["train"].to_tf_dataset(columns=["input_ids"], shuffle=True, batch_size=16, collate_fn=data_collator)

to the end of the notebook.

However, when I ran it, I got the following error: RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source.

Why didn't this work? How can I use the collator?

Solution 1:^[1]

The issue is not your code, but how the collator is set up. (It's set up to not use Tensorflow by default.)

If you look at this, you'll see that their collator uses the return_tensors="tf" argument. If you add this to your collator, your code for using the collator will work.

In short, your collator creation should look like

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm_probability=0.15, return_tensors="tf")

This will fix the issue.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Pro Q

'How to use Huggingface Data Collator

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]