'Tensorflow seq2seq - keep max three checkpoints not working

I am writing a seq2seq and would like to keep only three checkpoints; I thought I was implementing this with:

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)
manager = tf.train.CheckpointManager(
    checkpoint, directory=checkpoint_dir, max_to_keep=3)

Then:

  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

I am disappointed and this is not working. Would you have a hint?



Solution 1:[1]

It can happen in multiple ways, the limits is setting at checkpoint manager with callback you also applied the same.!

Input:

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Training
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
history = model.fit(batched_features, epochs=1000 ,validation_data=(batched_features), callbacks=[cp_callback])

checkpoint = tf.train.Checkpoint(model)
manager = tf.train.CheckpointManager( checkpoint, checkpoint_dir, max_to_keep=3 )
checkpoint.restore(checkpoint_dir)

Output:

Epoch 6/1000
1/1 [==============================] - ETA: 0s - loss: 1.6910e-05 - accuracy: 1.0000
???
Epoch 6: val_loss improved from 0.00002 to 0.00000, saving model to F:\models\checkpoint\test_checkpoint_restore_01
Epoch 628/1000
1/1 [==============================] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.0000e+00
Epoch 628: val_loss did not improve from 0.00000
1/1 [==============================] - 0s 160ms/step - loss: 0.0000e+00 - accuracy: 0.0000e+00 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 629/1000
1/1 [==============================] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.0000e+00
Epoch 629: val_loss did not improve from 0.00000
1/1 [==============================] - 0s 169ms/step - loss: 0.0000e+00 - accuracy: 0.0000e+00 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 630/1000
1/1 [==============================] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.0000e+00
Epoch 630: val_loss did not improve from 0.00000
1/1 [==============================] - 0s 162ms/step - loss: 0.0000e+00 - accuracy: 0.0000e+00 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 631/1000
1/1 [==============================] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.0000e+00
Epoch 631: val_loss did not improve from 0.00000
1/1 [==============================] - 0s 156ms/step - loss: 0.0000e+00 - accuracy: 0.0000e+00 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00

Target directory

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Martijn Pieters