'How to Pause / Resume Training in Tensorflow

This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official documentation on Save and Restore

Gist of old question:

I got TF working fine for the CIFAR Tutorial. I've changed the code to save the train_dir (directory with checkpoint and models) to a know location.

Which brings me to my question :How can I pause and resume some training with TF ?



Solution 1:[1]

TensorFlow uses Graph-like computation, Nodes(Ops) and Edges(Variables aka states) and it provide a Saver for it's Vars.

So as it's distributed computation you can run part of a graph in one machine/processor and the rest in the other, meanwhile you can save the state(Vars) and feed it next time to continue your work.

saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
...
saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'

which later you can use

tf.train.Saver.restore(sess, save_path)

to restore your saved Vars.

Saver Usage

Solution 2:[2]

Using tf.train.MonitoredTrainingSession() helped me to resume my training when my machine restarted.

Things to keep in Mind:

  1. Make sure you are saving your checkpoints. In tf.train.saver() you can specify max_checkpoints to keep.
  2. Specify the directory of the checkpoints in the tf.train.MonitoredTrainingSession(checkpoint='dir_path',save_checkpoint_secs=). Based on the save_checkpoint_secs argument, the above session would keep saving and updating the checkpoints.
  3. When you constantly keep saving the checkpoints, above function, looks for the latest checkpoint and resumes training from there.

Solution 3:[3]

As described by Hamed, the right way to do it on tensorflow is

    saver=tf.train.Saver()
    save_path='checkpoints/'
    -----> while training you can store using
    saver.save(sess=session,save_path=save_path)
    -----> and restore
    saver.restore(sess=session,save_path=save_path)

this will load the model where you last saved it and will the training(if you want) from there only.

Solution 4:[4]

1.Open checkpoint file and remove unwanted models from it. 2.Set model_checkpoint_path to last best model that you want to continue it. content of file is such this:

model_checkpoint_path: "model_gs_043k"
all_model_checkpoint_paths: "model_gs_041k"
all_model_checkpoint_paths: "model_gs_042k"
all_model_checkpoint_paths: "model_gs_043k"

Here, it continues from model_gs_043k

3.Remove the files also and then remove events files if exists then you can run your training. The training will be started from last best saved model that exists in model folder. If no model file exists, training will be started from begin.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Community
Solution 3 Saurabh Kumar
Solution 4