'Models in Tensorflow Federated get stucked at 0.1 accuracy

I'm trying train a federated model for the mnist dataset. I am using the code avaible at https://www.tensorflow.org/federated/tutorials/simulations for the setup. The dataset version being used is the the one from keras (not the federated version from leaf that is used in tff). I'm making a partition of it, saving it on a dictionary and implementing my ClientData instance with tff.simulation.datasets.TestClientData.
Applying this change works just fine. However, if I change the model from the simulation, every round gives me a ~0.1 accuracy.

The model in the tutorial is as simple as it can get, an input layer of 28*28=784 neurons stacked over an output layer of dim 10 with Softmax activation:

model = tf.keras.models.Sequential([
  tf.keras.layers.InputLayer(input_shape=(784,)), 
  tf.keras.layers.Dense(units=10, kernel_initializer='zeros'),
  tf.keras.layers.Softmax(),
])

And the new model is a cnn:

 model = tf.keras.Sequential(
        [
            tf.keras.layers.Conv2D(
                16,
                8,
                strides=2,
                padding="same",
                activation="relu",
                input_shape=(28, 28, 1),
            ),
            tf.keras.layers.MaxPool2D(2, 1),
            tf.keras.layers.Conv2D(
                32, 4, strides=2, padding="valid", activation="relu"
            ),
            tf.keras.layers.MaxPool2D(2, 1),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(32, activation="relu"),
            tf.keras.layers.Dense(10),
        ]
    )

Accuracy changed from round to round on the first case, increasing, reaching 0.94 quite fast. On the second case I ran it for about 240 rounds with 3 fixed clients, 20k elements each, 10 epochs, batch size 32. Still couldn't get out of the ~0.1 accuracy and loss of ~2.3

The model works fine for this dataset. I already tested it on a centrilized version and a federated version using Flower framework reaching 0.99 accuracy. But for some reason I can't make it work on tff.

Environment: MacOs BigSur tensorflow==2.8.0 tensorflow-federated==0.22.0

I expect the metrics and loss to change more. Could it be that there is a problem with using other Models?

Full code:

from tensorflow.keras.datasets import cifar10, mnist
import numpy as np

EPOCHS = 10
BATCH_SIZE = 32

# ROUND_CLIENTS <= NUM_CLIENTS
ROUND_CLIENTS = 3
NUM_CLIENTS = 3

NUM_ROUNDS = 400

    
def make_client(num_clients,X, y):
    total_image_count = len(X)
    image_per_set = int(np.floor(total_image_count/num_clients))

    client_train_dataset = collections.OrderedDict()
    for i in range(1, num_clients+1):
        client_name = i-1
        start = image_per_set * (i-1)
        end = image_per_set * i

        print(f"Adding data from {start} to {end} for client : {client_name}")
        data = collections.OrderedDict((('label', y[start:end]), ('pixels', X[start:end])))
        client_train_dataset[client_name] = data
    
    train_dataset = tff.simulation.datasets.TestClientData(client_train_dataset)
    
    return train_dataset


def preprocess(X: np.ndarray, y: np.ndarray):
    """Basic preprocessing for MNIST dataset."""
    X = np.array(X, dtype=np.float32) / 255
    X = X.reshape((X.shape[0], 28, 28, 1))

    y = np.array(y, dtype=np.int32)
    y = tf.keras.utils.to_categorical(y, num_classes=10)

    return X, y


(X_train, y_train), (X_test, y_test) = mnist.load_data()
(X_train, y_train) = preprocess(X_train, y_train)
(X_test, y_test) = preprocess(X_test, y_test)

mnistFedTrain = make_client(NUM_CLIENTS,X_train,y_train)

def map_fn(example):
    return collections.OrderedDict(
      x=example['pixels'], 
        y=example['label'])


def client_data(client_id):
    ds = mnistFedTrain.create_tf_dataset_for_client(mnistFedTrain.client_ids[client_id])
    return ds.repeat(EPOCHS).shuffle(500).batch(BATCH_SIZE).map(map_fn)


train_data = [client_data(n) for n in range(ROUND_CLIENTS)]
element_spec = train_data[0].element_spec

def create_cnn_model() -> tf.keras.Model:
    """Returns a sequential keras CNN Model."""
    return tf.keras.Sequential(
        [
            tf.keras.layers.Conv2D(
                16,
                8,
                strides=2,
                padding="same",
                activation="relu",
                input_shape=(28, 28, 1),
            ),
            tf.keras.layers.MaxPool2D(2, 1),
            tf.keras.layers.Conv2D(
                32, 4, strides=2, padding="valid", activation="relu"
            ),
            tf.keras.layers.MaxPool2D(2, 1),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(32, activation="relu"),
            tf.keras.layers.Dense(10),
        ]
    )

def model_fn():
    model = create_cnn_model()
    return tff.learning.from_keras_model(
      model,
      input_spec=element_spec,
      loss=tf.keras.losses.CategoricalCrossentropy(
                from_logits=True, reduction=tf.losses.Reduction.NONE
            ),
      metrics=[tf.keras.metrics.CategoricalAccuracy()]
    )


trainer = tff.learning.build_federated_averaging_process(
    model_fn, client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.02))


def evaluate(num_rounds=NUM_ROUNDS):
    state = trainer.initialize()
    for i in range(num_rounds):
        t1 = time.time()
        state, metrics = trainer.next(state, train_data)
        t2 = time.time()
        print('\n Round {r}: metrics {m}, round time {t:.2f} seconds'.format(
            m=metrics['train'], r=i, t=t2 - t1))

t1 = time.time()
evaluate(NUM_ROUNDS)
t2 = time.time()

print('Seconds:',t2 - t1,' = Minutes:', (t2 - t1)/60)

I've had a similar problem with other models as well, e.g. MobileNetV2 implemented in tf for cifar10: `model = tf.keras.applications.MobileNetV2((32, 32, 3), classes=10, weights=None)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Models in Tensorflow Federated get stucked at 0.1 accuracy

Sources

Related Questions