'CsvDataset taking a long time for large data
As my data continues to increase (over 30GB currently), and I am unable to load into memory, I have just started to use datasets (therefore I'm relatively new at it).
I have two csv files, one with hot encoded labels [0,0,0,0,1,0,0...] for each row (labels.csv), and another (mixture.csv) with the actual data itself. Each row is 1901 long, float numbers, and is representing 1-d data (like a spectroscopy); therefore, there are no feature columns.
Traditionally I have just read the whole thing in and stored it, I did this
chunksize = 10 ** 6
chunk_list = []
for chunk in pd.read_csv('mixture.csv', header = None, chunksize=chunksize, dtype=np.float16):
chunk_list.append(chunk)
All = pd.concat(chunk_list)
and something simlar for the labels. I would then do the traditional way for separating into training, labeling, and validation,
X_train, X_test, y_train, y_test = train_test_split(All, labels, test_size=0.25, random_state=12)
Followed by a reshape,
X_train = X_train.reshape(X_train.shape + (1,))
Attempting to change my stuff so it will work with tf.data (after quite a bit of reading and searching),
mixture_types = []
for i in range(1901):
mixture_types.append(tf.float32)
dataset = tf.data.experimental.CsvDataset('/content/mixture.csv', mixture_types, header=False)
mixture_label_types = []
for i in range(len(pure)):
mixture_label_types.append(tf.float32)
label_dataset = tf.data.experimental.CsvDataset('label.csv', mixture_label_types, header=False)
Yes, I know there is probably a better way to do the d-type, but I was reading from here and that was the best I came up with.
Now it appears that this reads each column in as a feature, so to get rid of that I followed what was done on this tutorial,
def pack_row(*row):
features = tf.stack(row[0:],1)
return features
packed_ds = dataset.batch(10000).map(pack_row).unbatch()
packed_label = label_dataset.batch(10000).map(pack_row).unbatch()
and since the same row in mixture.csv corresponds to its label in the same row of labels.csv,
together_dataset = tf.data.Dataset.zip((packed_ds, packed_label))
together_dataset = together_dataset.shuffle(10000, reshuffle_each_iteration=True)
together_dataset = together_dataset.batch(15)
When I go to run it,
model.fit(together_dataset,
epochs=1,
callbacks = [best_model],
verbose=2)
It takes significantly longer than if it is all loaded into memory (for example 6 minutes compared to 90 minutes). I knew it would be longer, but I didn't think it would be this bad.
Are there better ways to do this or am I doing something that is making it worse.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
