'How do I move a dataset from Huggingface to Google Cloud?
I am trying to use huggingface multi_nli to train a text multi-classification ai in google cloud. I want to call the ai from a firebase web app eventually. But when I try this code in colab:
!pip install datasets
from datasets import load_dataset
# Load only train set
dataset = load_dataset(path="multi_nli", split="train")
It says its saved in /root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72e... but I can't find the file, only a variable version, so I can't move it to google cloud. What is missing for the download to work? Is there some other workaround get it to go to google cloud?
Solution 1:[1]
It is easy to do with the method Dataset.save_to_disk
and the help of the package gcsfs
. You will need first to install gcsfs
:
pip install gcsfs
And then you can use the methods Dataset.save_to_disk
and Dataset.load_from_disk
to save and load the dataset from Google Cloud Storage bucket. To save it:
from datasets import load_dataset
from gcsfs import GCSFileSystem
fs = GCSFileSystem()
dataset = load_dataset(path="multi_nli", split="train")
dataset.save_to_disk("gs://YOUR_BUCKET_NAME_HERE/multi_nli/train", fs=fs)
This will create a directory in the Google Cloud Storage bucket BUCKET_NAME_HERE
with the content of the dataset. Then to load it back you only need to execute the following:
from datasets import Dataset
from gcsfs import GCSFileSystem
fs = GCSFileSystem()
dataset = Dataset.load_from_disk("gs://YOUR_BUCKET_NAME_HERE/multi_nli/train", fs=fs)
For more information, please refer to:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Gabriel MartÃn Blazquez |