'GPU's not active during validation in Keras
During training I can see that the GPU's are both active and running the data. Then once training is complete, I see the GPU activity drop to 0 and the CPU drops a little too. Could this have something to do with the way I am generating training data?
I have custom data generators feeding the model:
train_df, val_df = np.split(dataframe, [int(.8 * len(dataframe.index))])
trainbatches = math.ceil(len(train_df.index) / batchsize)
valbatches = math.ceil(len(val_df.index) / 1024)
train_gen1 = MixedGenerator(train_df, batchsize, scaler) # tensor Sequence
enqueued_train_gen = queue_generator(train_gen1)
val_gen1 = MixedGenerator(val_df, 1024, scaler)
enqueued_
val_gen = queue_generator(val_gen1)
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
print('numdata shape', (numdata.shape[1],))
with strategy.scope():
model = build_mixedinput_model(ir_imgs_shape=(120, 160, 1),
num_data_shape=(numdata.shape[1],),
opt=optimizer,
lossmet=lossmet,
dropout=dropout)
history = model.fit(enqueued_
train_gen,
steps_per_epoch=trainbatches,
validation_data=enqueued_
val_gen,
validation_steps=valbatches,
verbose=1,
epochs=epochs,
callbacks=clbks)
the training generator feeds batches of a predetermined size, but the validation generator feeds batches of size 1024. This is because I am training on small batches but I want the validation to go faster. My understanding is that this should work since batch size shouldn't matter for validation.
Is this normal behavior, and is there a better practice I could be using? The validation still takes place, though it takes some time and as I said does not make use of the GPU's.
Thanks in advance
UPDATE: only happens in the first epoch of each run, the subsequent epochs do validation extremely quickly. I am still curious why GPU's and CPU are not engaging during the validation process though.
Solution 1:[1]
The first epoch takes more time because the code is being compiled into a graph. This includes the model function, the loss function, validation metrics, and others. The compilation process can take some time, but it is only done once. That is why you see faster results after the first epoch.
Here are the lines in the tensorflow.keras source that compile the training step and the validation step. Those lines essentially call tf.function on the train and validation steps.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | jakub |
