'How to upload a 62 GB datasets to google colab
I am new to processing large datasets, new to google colab. I have a 62 GB datasets and I zipped it uploaded it to the Files section of google colab.
Before upload, it is 68 GB available so I cannot upload the zip file and unzip it, I don't have enough memory. Can anyone help me to process this datasets on google colab or any platform. I am currently a student and do not have too much money purchase better memory space.
Thank you very much.
Solution 1:[1]
You can upload datasets to your Colab notebook using these 4 methods
1.
Use !wget to download the dataset to the server
Colab is actually a Centos virtual machine with GPU. You can directly use the linux wget command to download the dataset to the server. The default is to download to the /content path
Download and unzip the dataset command:
#!wget https://download.pytorch.org/tutorial/hymenoptera_data.zip
#!unzip hymenoptera_data.zip -d ./
Load dataset command:
# Define the dataset using ImageFolder
# define data preprocessing
train_tf = tfs.Compose([
tfs.RandomResizedCrop(224),
tfs.RandomHorizontalFlip(),
tfs.ToTensor(),
tfs.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) # Use ImageNet mean and variance
])
train_set = ImageFolder('./hymenoptera_data/train/', train_tf)
2. Use Google Cloud Disk to load datasets
First, the command to mount Google Cloud Disk in Colab is as follows. After execution, you will be asked to enter the key of your Google account to mount
from google.colab import drive
drive.mount('/content/drive/')
Upload the file to Google Drive, such as data/data.csv. One way to upload is to upload manually, the other is to download to Google Cloud Disk through the wget command, and then load it for use
The advantage of storing in Google Cloud Disk is that the data will not be lost the next time you connect like the first method. The disadvantage is that The Google cloud disk is only 15g, which is not suitable for large data sets. The command to download the data set to the Google cloud disk is as follows:
import os
#Change the current working directory to the path of Google Cloud Drive
path="/content/drive/My Drive/Colab Notebooks/"
os.chdir(path)
os.listdir(path)
#Use the wget command to download the dataset to this path
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/data.csv
Load dataset
train = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/data.csv')
3. Load dataset from kaggle
If you are playing a game on kaggle, the data set you need is prepared on it, and you can download it directly using the kaggle command. You need to choose to create an api token in the my profile of kaggle, and then generate the username and key locally
{"username":"gongenbo","key":"f26dfa65d06321a37f6b8502cd6b8XXX"}
The following takes the driving state detection project as an example, the address: https://www.kaggle.com/c/state-farm-distracted-driver-detection/data
Command to download data via kaggle
!pip install -U -q kaggle
!mkdir -p ~/.kaggle
!echo '{"username":"gongenbo","key":"f26dfa65d06321a37f6b8502cd6b8XXX"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c state-farm-distracted-driver-detection
Command to submit scores to kaggle after training
!kaggle competitions submit -c state-farm-distracted-driver-detection -f submission.csv -m "Message"
4. Upload to disk using the upload button
Google provides 67G of disk space. Use the upload button to upload the image below. This method is suitable for small datasets or own datasets:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | J0hn |

