'how to make DataFlow from CSV in googleDrive to tf DataSet - in Colab

according to the instructions in Colab I could get buffer & even take a pd.DataFrame from it (file is just example)...

# ... authentification    

file_id = '1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU' # titanic

# loading data
import io
from googleapiclient.http import MediaIoBaseDownload

drive_service = build('drive', 'v3')      # , credentials=creds

request = drive_service.files().get_media(fileId=file_id)
buf = io.BytesIO()
downloader = MediaIoBaseDownload(buf, request)

buf.seek(0)

import pandas as pd
df= pd.read_csv(buf);
print(df.head())

But have trouble with correct creation of dataFlow to Dataset - "buf" var is not working in =>

dataset = tf.data.experimental.make_csv_dataset(csv_file_path, batch_size=100, num_epochs=1)

only "csv_file_path" as 1st arg. Is it possible in Colab to get IO from my GoogleDrive's csv-file into Dataset (used further in training)? And how to do it in a memory-efficient manner?..

P.S. I understand that I perhaps can make file opened for all (in GoogleDrive) & get url to use the simple way:

#TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TRAIN_DATA_URL = "https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
dataset = tf.data.experimental.make_csv_dataset(train_file_path, batch_size=100, num_epochs=1) 

! but I DON'T need to share real file... How to save file confidential & get IO from it (in GoogleDrive) to tf.data.Dataset in Colab ? (preferably the shortest code - there will be much more code in real project tested in Colab)



Solution 1:[1]

even simplier

# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

_types = [float(),  float(), float(), float(),  str()]
_lines = tf.data.TextLineDataset('/content/drive/My Drive/iris.csv')
ds=_lines.skip(1).map(lambda x: tf.io.decode_csv(x, record_defaults=_types) )
ds0= ds.take(2)
print(*ds0.as_numpy_iterator(), sep='\n')   # print list with sep => by rows.

OR from df: (and batched for memory economical usage)

import tensorflow as tf

# Load the Drive helper and mount
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/drive')

df= pd.read_csv('/content/drive/My Drive/iris.csv', dtype = 'float32', converters = {'variety' : str},  nrows=20, decimal='.')
ds = tf.data.Dataset.from_tensor_slices(dict(df))   # if mixed types
ds = ds.shuffle(20, reshuffle_each_iteration=False )   #  for train.ds ONLY!
ds = ds.batch(batch_size=4)
ds = ds.prefetch(4)

# labels
label=  ds.map(lambda x: x['variety'])

print(list(label.as_numpy_iterator()))

# features
#features = ds.map(lambda x: (x['sepal.length'], x['sepal.width']))
# Or with dynamic keys:
features =  ds.map(lambda x: (list(map(x.get, list(np.setdiff1d(list(x.keys()),['variety']))))))

print(list(features.as_numpy_iterator()))

with any Transformations in map...

Solution 2:[2]

drive.CreateFile HELPED (link) - as so as I understand that working in Colab - I am working in a separate environment (separate from my PC & I'net env)... So I tried (according link)

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing
link = 'https://drive.google.com/open?id=1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU'

fluff, id = link.split('=')
print (id) # Verify that you have everything after '='

downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('Filename.csv')

import tensorflow as tf
ds = tf.data.experimental.make_csv_dataset('Filename.csv', batch_size=100, num_epochs=1) 

iterator = ds.as_numpy_iterator()
print(next(iterator))

it works for me. Thanks for the interest to the topic (if somebody tried)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 JeeyCi