'how to make DataFlow from CSV in googleDrive to tf DataSet - in Colab
according to the instructions in Colab I could get buffer & even take a pd.DataFrame from it (file is just example)...
# ... authentification
file_id = '1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU' # titanic
# loading data
import io
from googleapiclient.http import MediaIoBaseDownload
drive_service = build('drive', 'v3') # , credentials=creds
request = drive_service.files().get_media(fileId=file_id)
buf = io.BytesIO()
downloader = MediaIoBaseDownload(buf, request)
buf.seek(0)
import pandas as pd
df= pd.read_csv(buf);
print(df.head())
But have trouble with correct creation of dataFlow to Dataset - "buf" var is not working in =>
dataset = tf.data.experimental.make_csv_dataset(csv_file_path, batch_size=100, num_epochs=1)
only "csv_file_path" as 1st arg. Is it possible in Colab to get IO from my GoogleDrive's csv-file into Dataset (used further in training)? And how to do it in a memory-efficient manner?..
P.S. I understand that I perhaps can make file opened for all (in GoogleDrive) & get url to use the simple way:
#TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TRAIN_DATA_URL = "https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
dataset = tf.data.experimental.make_csv_dataset(train_file_path, batch_size=100, num_epochs=1)
! but I DON'T need to share real file... How to save file confidential & get IO from it (in GoogleDrive) to tf.data.Dataset in Colab ? (preferably the shortest code - there will be much more code in real project tested in Colab)
Solution 1:[1]
even simplier
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')
_types = [float(), float(), float(), float(), str()]
_lines = tf.data.TextLineDataset('/content/drive/My Drive/iris.csv')
ds=_lines.skip(1).map(lambda x: tf.io.decode_csv(x, record_defaults=_types) )
ds0= ds.take(2)
print(*ds0.as_numpy_iterator(), sep='\n') # print list with sep => by rows.
OR from df: (and batched for memory economical usage)
import tensorflow as tf
# Load the Drive helper and mount
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/drive')
df= pd.read_csv('/content/drive/My Drive/iris.csv', dtype = 'float32', converters = {'variety' : str}, nrows=20, decimal='.')
ds = tf.data.Dataset.from_tensor_slices(dict(df)) # if mixed types
ds = ds.shuffle(20, reshuffle_each_iteration=False ) # for train.ds ONLY!
ds = ds.batch(batch_size=4)
ds = ds.prefetch(4)
# labels
label= ds.map(lambda x: x['variety'])
print(list(label.as_numpy_iterator()))
# features
#features = ds.map(lambda x: (x['sepal.length'], x['sepal.width']))
# Or with dynamic keys:
features = ds.map(lambda x: (list(map(x.get, list(np.setdiff1d(list(x.keys()),['variety']))))))
print(list(features.as_numpy_iterator()))
with any Transformations in map...
Solution 2:[2]
drive.CreateFile HELPED (link) - as so as I understand that working in Colab - I am working in a separate environment (separate from my PC & I'net env)... So I tried (according link)
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing
link = 'https://drive.google.com/open?id=1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU'
fluff, id = link.split('=')
print (id) # Verify that you have everything after '='
downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('Filename.csv')
import tensorflow as tf
ds = tf.data.experimental.make_csv_dataset('Filename.csv', batch_size=100, num_epochs=1)
iterator = ds.as_numpy_iterator()
print(next(iterator))
it works for me. Thanks for the interest to the topic (if somebody tried)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | JeeyCi |
