'CsvDataset taking a long time for large data

As my data continues to increase (over 30GB currently), and I am unable to load into memory, I have just started to use datasets (therefore I'm relatively new at it).

I have two csv files, one with hot encoded labels [0,0,0,0,1,0,0...] for each row (labels.csv), and another (mixture.csv) with the actual data itself. Each row is 1901 long, float numbers, and is representing 1-d data (like a spectroscopy); therefore, there are no feature columns.

Traditionally I have just read the whole thing in and stored it, I did this

chunksize = 10 ** 6
chunk_list = []
for chunk in pd.read_csv('mixture.csv', header = None, chunksize=chunksize, dtype=np.float16):
  chunk_list.append(chunk)
All = pd.concat(chunk_list)

and something simlar for the labels. I would then do the traditional way for separating into training, labeling, and validation,

X_train, X_test, y_train, y_test = train_test_split(All, labels, test_size=0.25, random_state=12)

Followed by a reshape,

X_train = X_train.reshape(X_train.shape + (1,))

Attempting to change my stuff so it will work with tf.data (after quite a bit of reading and searching),

mixture_types = []
for i in range(1901):
  mixture_types.append(tf.float32)

dataset = tf.data.experimental.CsvDataset('/content/mixture.csv', mixture_types, header=False)

mixture_label_types = []
for i in range(len(pure)):
  mixture_label_types.append(tf.float32)

label_dataset = tf.data.experimental.CsvDataset('label.csv', mixture_label_types, header=False)

Yes, I know there is probably a better way to do the d-type, but I was reading from here and that was the best I came up with.

Now it appears that this reads each column in as a feature, so to get rid of that I followed what was done on this tutorial,

def pack_row(*row):
  features = tf.stack(row[0:],1)
  return features

packed_ds = dataset.batch(10000).map(pack_row).unbatch()
packed_label = label_dataset.batch(10000).map(pack_row).unbatch()

and since the same row in mixture.csv corresponds to its label in the same row of labels.csv,

together_dataset = tf.data.Dataset.zip((packed_ds, packed_label))
together_dataset = together_dataset.shuffle(10000, reshuffle_each_iteration=True)
together_dataset = together_dataset.batch(15)

When I go to run it,

model.fit(together_dataset,
                 epochs=1,
                 callbacks = [best_model],
                 verbose=2)

It takes significantly longer than if it is all loaded into memory (for example 6 minutes compared to 90 minutes). I knew it would be longer, but I didn't think it would be this bad.

Are there better ways to do this or am I doing something that is making it worse.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source