'Same training errors in all nodes when using tensorflow MultiWorkerMirroredStrategy
I am trying to train model in three machines using tensorflow MultiWorkerMirroredStrategy. The script is based on the tensorflow tutorial:Multi-worker training with Keras(https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#dataset_sharding_and_batch_size):
import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()
import os
import json
strategy = tf.distribute.MultiWorkerMirroredStrategy()
BUFFER_SIZE = 10000
BATCH_SIZE = 64
def make_datasets_unbatched():
# scale MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
# data download to /home/pzs/tensorflow_datasets/mnist/
datasets, info = tfds.load(name='mnist',
with_info=True,
as_supervised=True)
return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)
def build_and_compile_cnn_model():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
metrics=['accuracy'])
return model
NUM_WORKERS = 3
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)
train_datasets = train_datasets.with_options(options)
with strategy.scope():
multi_worker_model = build_and_compile_cnn_model()
multi_worker_model.fit(x=train_datasets, epochs=30, steps_per_epoch=5)
I run this script separately on tree node3: on node 1:
TF_CONFIG='{"cluster": {"worker": ["192.168.4.36:12346", "192.168.4.83:12346", "192.168.4.83:12346"]}, "task": {"index": 0, "type": "worker"}}' python3 multi_worker_with_keras.py
on node 2:
TF_CONFIG='{"cluster": {"worker": ["192.168.4.36:12346", "192.168.4.83:12346", "192.168.4.83:12346"]}, "task": {"index": 1, "type": "worker"}}' python3 multi_worker_with_keras.py
on node 3:
TF_CONFIG='{"cluster": {"worker": ["192.168.4.36:12346", "192.168.4.83:12346", "192.168.4.83:12346"]}, "task": {"index": 2, "type": "worker"}}' python3 multi_worker_with_keras.py
and the results of training error and accuracy are:
Epoch 1/30
2022-02-16 11:52:25.060362: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201
5/5 [==============================] - 7s 195ms/step - loss: 2.3010 - accuracy: 0.0719
Epoch 2/30
5/5 [==============================] - 1s 181ms/step - loss: 2.2984 - accuracy: 0.0688
Epoch 3/30
5/5 [==============================] - 1s 182ms/step - loss: 2.2993 - accuracy: 0.0781
Epoch 4/30
5/5 [==============================] - 1s 182ms/step - loss: 2.2917 - accuracy: 0.0594
Epoch 5/30
5/5 [==============================] - 1s 182ms/step - loss: 2.2987 - accuracy: 0.0969
Epoch 6/30
5/5 [==============================] - 1s 183ms/step - loss: 2.2992 - accuracy: 0.0906
Epoch 7/30
5/5 [==============================] - 1s 181ms/step - loss: 2.2978 - accuracy: 0.1000
Epoch 8/30
5/5 [==============================] - 1s 183ms/step - loss: 2.2887 - accuracy: 0.0969
Epoch 9/30
5/5 [==============================] - 1s 182ms/step - loss: 2.2887 - accuracy: 0.0969
Epoch 10/30
5/5 [==============================] - 1s 183ms/step - loss: 2.2930 - accuracy: 0.0844
Epoch 11/30
5/5 [==============================] - 1s 184ms/step - loss: 2.2905 - accuracy: 0.1000
Epoch 12/30
5/5 [==============================] - 1s 184ms/step - loss: 2.2884 - accuracy: 0.0812
Epoch 13/30
5/5 [==============================] - 1s 186ms/step - loss: 2.2837 - accuracy: 0.1250
Epoch 14/30
5/5 [==============================] - 1s 189ms/step - loss: 2.2842 - accuracy: 0.1094
Epoch 15/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2856 - accuracy: 0.0750
Epoch 16/30
5/5 [==============================] - 1s 192ms/step - loss: 2.2911 - accuracy: 0.0719
Epoch 17/30
5/5 [==============================] - 1s 188ms/step - loss: 2.2805 - accuracy: 0.1031
Epoch 18/30
5/5 [==============================] - 1s 187ms/step - loss: 2.2800 - accuracy: 0.1219
Epoch 19/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2799 - accuracy: 0.1063
Epoch 20/30
5/5 [==============================] - 1s 192ms/step - loss: 2.2769 - accuracy: 0.1187
Epoch 21/30
5/5 [==============================] - 1s 193ms/step - loss: 2.2768 - accuracy: 0.1344
Epoch 22/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2754 - accuracy: 0.1187
Epoch 23/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2821 - accuracy: 0.1187
Epoch 24/30
5/5 [==============================] - 1s 188ms/step - loss: 2.2832 - accuracy: 0.0844
Epoch 25/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2793 - accuracy: 0.1125
Epoch 26/30
5/5 [==============================] - 1s 191ms/step - loss: 2.2762 - accuracy: 0.1406
Epoch 27/30
5/5 [==============================] - 1s 194ms/step - loss: 2.2696 - accuracy: 0.1344
Epoch 28/30
5/5 [==============================] - 1s 192ms/step - loss: 2.2717 - accuracy: 0.1406
Epoch 29/30
5/5 [==============================] - 1s 191ms/step - loss: 2.2680 - accuracy: 0.1500
Epoch 30/30
5/5 [==============================] - 1s 193ms/step - loss: 2.2696 - accuracy: 0.1500
all results are exactly the same for 3 nodes.
my question is:
When using tf.distribute.MultiWorkerMirroredStrategy to train model among multiple machines, each process does forward and backward propagation independently using different slice of a batch training data, why the training errors are all the same for the corresponding epoch in 3 nodes? I try to run a different script and found the same case.
Solution 1:[1]
This is expected. The metric values would be allreduced in
fitmethod.
https://github.com/tensorflow/tensorflow/issues/39343#issuecomment-627008557
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
