'Training with almost 1 million spectrograms and transfer learning! Bottleneck in data_generator?
Hello I am trying to finetune a pretrained neural network for sound classification. My dataset is pretty big and I have almost 1 million frames of 1 second. I am using a data generator for loading the spectrograms to feed the network but I think my method is a bit slow and not efficient since, even though I have 2 GPUs I can't speed up the training. Here's how I load the spectrograms in the generator:
def gen_spectrogram(self, filenames,idx_frame):
spect_path=self.data_path+'SpectrogramsPositivePatches/'
X_data=[]
j=0
for f in filenames:
Sxx = load(spect_path+f)
index = idx_frame[j]
X_data.append(Sxx[index].reshape(Sxx.shape[1],Sxx.shape[2],1))
j+=1
return X_data
def get_next(self, partition):
if partition=='train':
cur_index=self.train_index
audio_files = list(zip(*self.train))[1]
idx_frame=list(zip(*self.train))[2]
label = list(zip(*self.train))[3]
elif partition=='test':
cur_index=self.test_index
audio_files=list(zip(*self.test))[1]
idx_frame=list(zip(*self.test))[2]
label = list(zip(*self.test))[3]
filenames=audio_files[cur_index: cur_index+self.batch_size]
idx_frames=idx_frame[cur_index: cur_index+self.batch_size]
labels = label[cur_index: cur_index+self.batch_size]
X_data = self.gen_spectrogram(filenames,idx_frames)
return np.array(X_data), np.array(labels)
def next_train(self):
while True:
ret = self.get_next('train')
self.train_index += self.batch_size
if self.train_index > len(self.train) - self.batch_size:
self.train_index = 0
self.shuffle_data_by_partition('train')
yield ret
def next_test(self):
while True:
ret = self.get_next('test')
self.test_index += self.batch_size
if self.test_index > len(self.test) - self.batch_size:
self.test_index = 0
self.shuffle_data_by_partition('test')
yield ret
So I am using a GLOBAL_BATCH_SIZE of 256 (128 per GPU) using a mirrored strategy. Is there a way to speed up the training since It keep saying 40 hrs for epochs even though it has to train only a Dense Layer and that's it so I think the bottle neck is in the datagenerator. I have checked on https://keras.io/guides/distributed_training/ this link that I should use tf.Dataset but I am not getting how to use it. Do you have an idea on how I should remove this bottle neck?
Thank you
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
