'How to Pause / Resume Training in Tensorflow
This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official documentation on Save and Restore
Gist of old question:
I got TF working fine for the CIFAR Tutorial. I've changed the code to save the
train_dir(directory with checkpoint and models) to a know location.Which brings me to my question :How can I pause and resume some training with TF ?
Solution 1:[1]
TensorFlow uses Graph-like computation, Nodes(Ops) and Edges(Variables aka states) and it provide a Saver for it's Vars.
So as it's distributed computation you can run part of a graph in one machine/processor and the rest in the other, meanwhile you can save the state(Vars) and feed it next time to continue your work.
saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
...
saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'
which later you can use
tf.train.Saver.restore(sess, save_path)
to restore your saved Vars.
Solution 2:[2]
Using tf.train.MonitoredTrainingSession() helped me to resume my training when my machine restarted.
Things to keep in Mind:
- Make sure you are saving your checkpoints. In tf.train.saver() you can specify max_checkpoints to keep.
- Specify the directory of the checkpoints in the tf.train.MonitoredTrainingSession(checkpoint='dir_path',save_checkpoint_secs=). Based on the save_checkpoint_secs argument, the above session would keep saving and updating the checkpoints.
- When you constantly keep saving the checkpoints, above function, looks for the latest checkpoint and resumes training from there.
Solution 3:[3]
As described by Hamed, the right way to do it on tensorflow is
saver=tf.train.Saver()
save_path='checkpoints/'
-----> while training you can store using
saver.save(sess=session,save_path=save_path)
-----> and restore
saver.restore(sess=session,save_path=save_path)
this will load the model where you last saved it and will the training(if you want) from there only.
Solution 4:[4]
1.Open checkpoint file and remove unwanted models from it.
2.Set model_checkpoint_path to last best model that you want to continue it.
content of file is such this:
model_checkpoint_path: "model_gs_043k"
all_model_checkpoint_paths: "model_gs_041k"
all_model_checkpoint_paths: "model_gs_042k"
all_model_checkpoint_paths: "model_gs_043k"
Here, it continues from model_gs_043k
3.Remove the files also and then remove events files if exists then you can run your training. The training will be started from last best saved model that exists in model folder. If no model file exists, training will be started from begin.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Community |
| Solution 3 | Saurabh Kumar |
| Solution 4 |
