'Loading files from .npz with Keras custom data generator
I'm training a CNN using a Keras data generator (tf.keras.utils.Sequence) on NumPy arrays saved locally. For my baseline on a small dataset, loading the arrays like so works fine:
X_data = np.load("X_data.npz")
y_data = np.load("y_data.npz")
X = X_data["arr_0"]
y = y_data["arr_0"]
class Gen_Data(tf.keras.utils.Sequence):
def __init__(self,batch_size,t_size,X_data,y_data):
self.batch_size = batch_size
self.t_size = t_size
self.X_data = X_data
self.y_data = y_data
self.n_batches = len(self.X_data)//self.batch_size
self.batch_idx = np.array_split(range(len(self.X_data)), self.n_batches)
def __len__(self):
return self.n_batches
def __getitem__(self,idx):
batch_X = self.X_data[self.batch_idx[idx]]
batch_y = self.y_data[self.batch_idx[idx]]
return batch_X, batch_y
The problem is that this requires me to load the whole dataset into memory, right? Which might be a problem when I use my actual dataset which is much much larger. When the arrays are saved as .npz files using np.savez_compressed, I have to load the whole thing to access any of the data.
I would therefore like to know if there is a way to load batches of arrays from X.npz without loading it in its entirety? This way I can build this into the __getitem__ part of the generator.
I would also be interested in hearing any alternative strategies for batch-loading large arrays into Keras/TensorFlow. Thanks for any advice!
Solution 1:[1]
You could try converting your npz files to hdf5 format. h5py supports compression and on-disc access like a memmap and provides a numpy like access to the data.
Solution 2:[2]
EDIT: After reading more on the docs, I realised that load() called on .npz files returns a lazy loader (an NpzFile object), thus not actually storing all the data in ram at the same time. It only stores file names and it will bring them in main memory only when required (calling data["filename"]), meaning you shouldn't worry about the data being loaded into main memory when you call load()
You could instead store the filenames in the generator, and load them when needed during training, using something like this:
class DataGenerator(tf.keras.utils.Sequence):
def __init__(self, db_dir, batch_size, input_shape, num_classes,
shuffle=True):
self.input_shape = input_shape
self.batch_size = batch_size
self.num_classes = num_classes
self.shuffle = shuffle
self.db_dir = db_dir
# load the data from the root directory
self.data, self.labels = self.get_data(db_dir)
self.indices = np.arange(len(self.data))
self.on_epoch_end()
def get_data(self, root_dir):
""""
Loads the paths to the images and their corresponding labels from the database directory
"""
self.data = []
self.labels = []
with open('labels.txt', 'r') as f:
for line in f.readlines():
splitted = line.split()
self.data.append("./{0}/".format(root_dir) + splitted[0] + ".jpg")
self.labels.append(int(splitted[1]))
return self.data, self.labels
def __len__(self):
"""
Returns the number of batches per epoch: the total size of the dataset divided by the batch size
"""
return int(np.floor(len(self.data) / self.batch_size))
def __getitem__(self, index):
""""
Generates a batch of data
"""
batch_indices = self.indices[index*self.batch_size : (index+1)*self.batch_size]
for index in batch_indices:
img = cv2.imread(self.data[index])
dimensions = img.shape
batch_x = [cv2.cvtColor(cv2.imread(self.data[index]), cv2.COLOR_BGR2RGB) for index in batch_indices] # TODO load the image from batch_indices
batch_y = [self.labels[index] - 1 for index in batch_indices] # TODO load the corresponding labels of the images you loaded
# optionally you can use: batch_y = tf.keras.utils.to_categorical(batch_y, num_classes=self.num_classes)
return np.array(batch_x), np.array(batch_y)
def on_epoch_end(self):
""""
Called at the end of each epoch
"""
# if required, shuffle your data after each epoch
self.indices = np.arange(len(self.data))
if self.shuffle:
# TODO shuffle data
# you might find np.random.shuffle useful here
np.random.shuffle(self.indices)
In this example i was using jpg's and loading them using opencv, but it should be similar with any file format. This will only load data when needed, and otherwise it will only store the path and indices. The indices are only necessary if you want to reshuffle between epochs.
That though of course requires you actually splitting the data into separate files, instead of having everything in a single file. Of course loading a single file will always be stored in RAM completely.
Solution 3:[3]
depending on the size of the dataset, you could break it down into smaller files on a remote server with more memory available, or use numpy memmap : https://numpy.org/doc/stable/reference/generated/numpy.memmap.html
memmap does not load the file into RAM, but references the file in storage.
- You must specify the shape or else it will be flattened
- This is extremely slow to do calculations with as it is not in memory, batches are recommended with in memory data.
Example:
fp = np.memmap("ReallyBigArray.npy", dtype='float32', mode='r', shape = (cols,rows))
# load a slice of the array into memory for processing
fpInMemory = fp[:1000]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Raphael |
| Solution 2 | |
| Solution 3 | Tyler Robbins |
