'Same training errors in all nodes when using tensorflow MultiWorkerMirroredStrategy

I am trying to train model in three machines using tensorflow MultiWorkerMirroredStrategy. The script is based on the tensorflow tutorial:Multi-worker training with Keras(https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#dataset_sharding_and_batch_size):

import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()

import os
import json

strategy = tf.distribute.MultiWorkerMirroredStrategy()

BUFFER_SIZE = 10000
BATCH_SIZE = 64

def make_datasets_unbatched():
  # scale MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label
  # data download to /home/pzs/tensorflow_datasets/mnist/
  datasets, info = tfds.load(name='mnist',
                            with_info=True,
                            as_supervised=True)

  return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)


def build_and_compile_cnn_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10)
  ])
  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
      metrics=['accuracy'])
  return model

NUM_WORKERS = 3
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS

train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF

train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)
train_datasets = train_datasets.with_options(options)

with strategy.scope():
  multi_worker_model = build_and_compile_cnn_model()

multi_worker_model.fit(x=train_datasets, epochs=30, steps_per_epoch=5)

I run this script separately on tree node3: on node 1:

TF_CONFIG='{"cluster": {"worker": ["192.168.4.36:12346", "192.168.4.83:12346",  "192.168.4.83:12346"]}, "task": {"index": 0, "type": "worker"}}' python3 multi_worker_with_keras.py

on node 2:

TF_CONFIG='{"cluster": {"worker": ["192.168.4.36:12346", "192.168.4.83:12346",  "192.168.4.83:12346"]}, "task": {"index": 1, "type": "worker"}}' python3 multi_worker_with_keras.py

on node 3:

TF_CONFIG='{"cluster": {"worker": ["192.168.4.36:12346", "192.168.4.83:12346",  "192.168.4.83:12346"]}, "task": {"index": 2, "type": "worker"}}' python3 multi_worker_with_keras.py

and the results of training error and accuracy are:

Epoch 1/30
2022-02-16 11:52:25.060362: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201
5/5 [==============================] - 7s 195ms/step - loss: 2.3010 - accuracy: 0.0719
Epoch 2/30
5/5 [==============================] - 1s 181ms/step - loss: 2.2984 - accuracy: 0.0688
Epoch 3/30
5/5 [==============================] - 1s 182ms/step - loss: 2.2993 - accuracy: 0.0781
Epoch 4/30
5/5 [==============================] - 1s 182ms/step - loss: 2.2917 - accuracy: 0.0594
Epoch 5/30
5/5 [==============================] - 1s 182ms/step - loss: 2.2987 - accuracy: 0.0969
Epoch 6/30
5/5 [==============================] - 1s 183ms/step - loss: 2.2992 - accuracy: 0.0906
Epoch 7/30
5/5 [==============================] - 1s 181ms/step - loss: 2.2978 - accuracy: 0.1000
Epoch 8/30
5/5 [==============================] - 1s 183ms/step - loss: 2.2887 - accuracy: 0.0969
Epoch 9/30
5/5 [==============================] - 1s 182ms/step - loss: 2.2887 - accuracy: 0.0969
Epoch 10/30
5/5 [==============================] - 1s 183ms/step - loss: 2.2930 - accuracy: 0.0844
Epoch 11/30
5/5 [==============================] - 1s 184ms/step - loss: 2.2905 - accuracy: 0.1000
Epoch 12/30
5/5 [==============================] - 1s 184ms/step - loss: 2.2884 - accuracy: 0.0812
Epoch 13/30
5/5 [==============================] - 1s 186ms/step - loss: 2.2837 - accuracy: 0.1250
Epoch 14/30
5/5 [==============================] - 1s 189ms/step - loss: 2.2842 - accuracy: 0.1094
Epoch 15/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2856 - accuracy: 0.0750
Epoch 16/30
5/5 [==============================] - 1s 192ms/step - loss: 2.2911 - accuracy: 0.0719
Epoch 17/30
5/5 [==============================] - 1s 188ms/step - loss: 2.2805 - accuracy: 0.1031
Epoch 18/30
5/5 [==============================] - 1s 187ms/step - loss: 2.2800 - accuracy: 0.1219
Epoch 19/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2799 - accuracy: 0.1063
Epoch 20/30
5/5 [==============================] - 1s 192ms/step - loss: 2.2769 - accuracy: 0.1187
Epoch 21/30
5/5 [==============================] - 1s 193ms/step - loss: 2.2768 - accuracy: 0.1344
Epoch 22/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2754 - accuracy: 0.1187
Epoch 23/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2821 - accuracy: 0.1187
Epoch 24/30
5/5 [==============================] - 1s 188ms/step - loss: 2.2832 - accuracy: 0.0844
Epoch 25/30
5/5 [==============================] - 1s 190ms/step - loss: 2.2793 - accuracy: 0.1125
Epoch 26/30
5/5 [==============================] - 1s 191ms/step - loss: 2.2762 - accuracy: 0.1406
Epoch 27/30
5/5 [==============================] - 1s 194ms/step - loss: 2.2696 - accuracy: 0.1344
Epoch 28/30
5/5 [==============================] - 1s 192ms/step - loss: 2.2717 - accuracy: 0.1406
Epoch 29/30
5/5 [==============================] - 1s 191ms/step - loss: 2.2680 - accuracy: 0.1500
Epoch 30/30
5/5 [==============================] - 1s 193ms/step - loss: 2.2696 - accuracy: 0.1500

all results are exactly the same for 3 nodes.

my question is:

When using tf.distribute.MultiWorkerMirroredStrategy to train model among multiple machines, each process does forward and backward propagation independently using different slice of a batch training data, why the training errors are all the same for the corresponding epoch in 3 nodes? I try to run a different script and found the same case.

Solution 1:^[1]

This is expected. The metric values would be allreduced in fit method.

https://github.com/tensorflow/tensorflow/issues/39343#issuecomment-627008557

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Same training errors in all nodes when using tensorflow MultiWorkerMirroredStrategy

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]