'Is it possible to continue training from a specific epoch?

A resource manager I'm using to fit a Keras model limits the access to a server to 1 day at a time. After this day, I need to start a new job. Is it possible with Keras to save the current model at epoch K, and then load that model to continue training epoch K+1 (i.e., with a new job)?



Solution 1:[1]

You can save weights after every epoch by specifying a callback:

weight_save_callback = ModelCheckpoint('/path/to/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss', verbose=0, save_best_only=False, mode='auto')
model.fit(X_train,y_train,batch_size=batch_size,nb_epoch=nb_epoch,callbacks=[weight_save_callback])

This will save the weights after every epoch. You can then load them with:

model = Sequential()
model.add(...)
model.load('path/to/weights.hf5')

Of course your model needs to be the same in both cases.

Solution 2:[2]

You can add the initial_epoch argument. This will allow you to continue training from a specific epoch.

Solution 3:[3]

You can automatically start your training at the next epoch..!

What you need is to keep track of your training with a training log file as follow:

from keras.callbacks       import ModelCheckpoint, CSVLogger

if len(sys.argv)==1:
    model=...                           # you start training normally, no command line arguments
    model.compile(...)
    i_epoch=-1                          # you need this to start at epoch 0
    app=False                           # you want to start logging from scratch
else:
    from keras.models import load_model 
    model=load_model(sys.argv[1])       # you give the saved model as input file
    with open(csvloggerfile) as f:      # you use your training log to get the right epoch number
         i_epoch=list(f)
         i_epoch=int(i_epoch[-2][:i_epoch[-2].find(',')])
    app=True                            # you want to append to the log file

checkpointer = ModelCheckpoint(savemodel...)
csv_logger = CSVLogger(csvloggerfile, append=app)

model.fit(X, Y, initial_epoch=i_epoch+1, callbacks=[checkpointer,csv_logger])            

That's all folks!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Henryk Borzymowski
Solution 3