'How do I move a dataset from Huggingface to Google Cloud?

I am trying to use huggingface multi_nli to train a text multi-classification ai in google cloud. I want to call the ai from a firebase web app eventually. But when I try this code in colab:

!pip install datasets
from datasets import load_dataset



# Load only train set
dataset = load_dataset(path="multi_nli", split="train")

It says its saved in /root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72e... but I can't find the file, only a variable version, so I can't move it to google cloud. What is missing for the download to work? Is there some other workaround get it to go to google cloud?



Solution 1:[1]

It is easy to do with the method Dataset.save_to_disk and the help of the package gcsfs. You will need first to install gcsfs:

pip install gcsfs

And then you can use the methods Dataset.save_to_disk and Dataset.load_from_disk to save and load the dataset from Google Cloud Storage bucket. To save it:

from datasets import load_dataset
from gcsfs import GCSFileSystem

fs = GCSFileSystem()

dataset = load_dataset(path="multi_nli", split="train")

dataset.save_to_disk("gs://YOUR_BUCKET_NAME_HERE/multi_nli/train", fs=fs)

This will create a directory in the Google Cloud Storage bucket BUCKET_NAME_HERE with the content of the dataset. Then to load it back you only need to execute the following:

from datasets import Dataset
from gcsfs import GCSFileSystem

fs = GCSFileSystem()

dataset = Dataset.load_from_disk("gs://YOUR_BUCKET_NAME_HERE/multi_nli/train", fs=fs)

For more information, please refer to:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gabriel Martín Blazquez