'Anyway to work with Tensorflow in Mac with Apple Silicon (M1, M1 Pro, M1 Max) GPU?

I have a MacBook Pro with an M1 Max processor and I want to run Tensorflow on this GPU. I have followed the steps from https://developer.apple.com/metal/tensorflow-plugin but I don't know why it runs slower on my GPU. I tested with MNIST tutorial from google official page).

Code I tried
import tensorflow as tf
import tensorflow_datasets as tfds

DISABLE_GPU = False

if DISABLE_GPU:
    try:
        # Disable all GPUS
        tf.config.set_visible_devices([], 'GPU')
        visible_devices = tf.config.get_visible_devices()
        for device in visible_devices:
            assert device.device_type != 'GPU'
    except:
        # Invalid device or cannot modify virtual devices once initialized.
        pass

print(tf.__version__)

(ds_train, ds_test), ds_info = tfds.load('mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True,
                                         with_info=True)


def normalize_img(image, label):
    return tf.cast(image, tf.float32) / 255., label


ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
    normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(ds_train, epochs=6, validation_data=ds_test, )
Output (GPU):
462/469 [============================>.] - ETA: 0s - loss: 0.3619 - sparse_categorical_accuracy: 0.9003
469/469 [==============================] - 4s 5ms/step - loss: 0.3595 - sparse_categorical_accuracy: 0.9008 - val_loss: 0.1963 - val_sparse_categorical_accuracy: 0.9432
Epoch 2/6
469/469 [==============================] - 2s 5ms/step - loss: 0.1708 - sparse_categorical_accuracy: 0.9514 - val_loss: 0.1392 - val_sparse_categorical_accuracy: 0.9606
Epoch 3/6
469/469 [==============================] - 2s 5ms/step - loss: 0.1224 - sparse_categorical_accuracy: 0.9651 - val_loss: 0.1233 - val_sparse_categorical_accuracy: 0.9650
Epoch 4/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0956 - sparse_categorical_accuracy: 0.9725 - val_loss: 0.0988 - val_sparse_categorical_accuracy: 0.9696
Epoch 5/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0766 - sparse_categorical_accuracy: 0.9780 - val_loss: 0.0875 - val_sparse_categorical_accuracy: 0.9727
Epoch 6/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0633 - sparse_categorical_accuracy: 0.9813 - val_loss: 0.0842 - val_sparse_categorical_accuracy: 0.9745

Output (without GPU)
469/469 [==============================] - 2s 1ms/step - loss: 0.3598 - sparse_categorical_accuracy: 0.9013 - val_loss: 0.1970 - val_sparse_categorical_accuracy: 0.9427
Epoch 2/6
469/469 [==============================] - 0s 933us/step - loss: 0.1705 - sparse_categorical_accuracy: 0.9511 - val_loss: 0.1449 - val_sparse_categorical_accuracy: 0.9589
Epoch 3/6
469/469 [==============================] - 0s 936us/step - loss: 0.1232 - sparse_categorical_accuracy: 0.9642 - val_loss: 0.1146 - val_sparse_categorical_accuracy: 0.9655
Epoch 4/6
469/469 [==============================] - 0s 925us/step - loss: 0.0955 - sparse_categorical_accuracy: 0.9725 - val_loss: 0.1007 - val_sparse_categorical_accuracy: 0.9690
Epoch 5/6
469/469 [==============================] - 0s 946us/step - loss: 0.0774 - sparse_categorical_accuracy: 0.9781 - val_loss: 0.0890 - val_sparse_categorical_accuracy: 0.9732
Epoch 6/6
469/469 [==============================] - 0s 971us/step - loss: 0.0647 - sparse_categorical_accuracy: 0.9811 - val_loss: 0.0844 - val_sparse_categorical_accuracy: 0.9752


Solution 1:[1]

The reason why it runs slower could be because of the small batch size used in the tutorial. However, make sure you have set up everything correctly as below. We will use miniforge instead of anaconda as it doesn't have GPU support.

Setting up miniforge for TensorFlow support

  1. Download Miniforge3-MacOSX-arm64.sh
  2. Run the file using the following command:-
    • ./Miniforge3-MacOSX-arm64.sh
    • (Don't run above as sudo. If you get permission error, first run chmod +x ./Miniforge3-MacOSX-arm64.sh)
  3. It will download miniforge in the current directory. Now you have to activate it. Use the following command to do so.
    • source miniforge3/bin/activate
  4. You should see (conda) is prepended in your command line. To make sure it is activated during terminal start-up. Use the following command.
    • conda init
    • or if you are using zsh, conda init zsh
  5. Make sure it is activated properly. To check it use which python. It should show .../miniforge3/bin/python. If it doesn't show it, first remove miniforge3 directory and try to install again from step 2. Also, make sure your anaconda environment is disabled.

Now we'll install TensorFlow and its dependencies.

  1. Create a new environment on the top of conda environment using the following command and activate it.
    • conda create -n tensorflow python=<your-python-version
    • (use python --version to find it out)
    • conda activate tensorflow
  2. Now install the TensorFlow dependencies using the following command.
    • conda install -c apple tensorflow-deps.
  3. Install Tensorflow and Tensorflow metal for mac using following command.
    • pip install tensorflow-macos
    • pip install tensorflow-metal

Additional packages

  1. Install jupyter using following command.
    • conda install -c conda-forge jupyterlab

Troubleshooting

  1. 'miniforge3/envs/tensorflow/lib/libcblas.3.dylib' (no such file) or similar libcblas error.
    Solution: conda install -c conda-forge openblas

  2. /tensorflow/core/framework/tensor.h:880] Check failed: IsAligned() ptr = 0x101511d60
    Solution: I found this error when using Tensorflow >2.5.0 for some program. Use TensorFlow version 2.5.0. To reinstall it do following.

    • pip uninstall tensorflow-macos
    • pip uninstall tensorflow-metal
    • conda install -c apple tensorflow-deps==2.5.0 --force-reinstall (optional, try only if you get error)
    • pip install tensorflow-mac==2.5.0
    • pip install tensorflow-metal
  3. You may encounter an import error in jupyter when you import tensorflow. This should be fixed by installing a new kernel.

    • python -m ipykernel install --user --name tensorflow --display-name "Python <your-python-version> (tensorflow)"
    • Important: When you launch jupyter, make sure to select this kernel. Also, jupyter outside tensorflow environment can import tensorflow using this kernel (i.e. You don't have to activate tensorflow environment everytime you want to use it in jupyter).

Testing (M1 Max, 10-core CPU, 24-core GPU version)

Code:
import tensorflow as tf
import tensorflow_datasets as tfds

DISABLE_GPU = False

if DISABLE_GPU:
    try:
        # Disable all GPUS
        tf.config.set_visible_devices([], 'GPU')
        visible_devices = tf.config.get_visible_devices()
        for device in visible_devices:
            assert device.device_type != 'GPU'
    except:
        # Invalid device or cannot modify virtual devices once initialized.
        pass

print(tf.__version__)

(ds_train, ds_test), ds_info = tfds.load('mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True,
                                         with_info=True)


def normalize_img(image, label):
    return tf.cast(image, tf.float32) / 255., label


ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
    normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(ds_train, epochs=6, validation_data=ds_test, )
For batch size = 128
Output (GPU)
462/469 [============================>.] - ETA: 0s - loss: 0.3619 - sparse_categorical_accuracy: 0.9003
469/469 [==============================] - 4s 5ms/step - loss: 0.3595 - sparse_categorical_accuracy: 0.9008 - val_loss: 0.1963 - val_sparse_categorical_accuracy: 0.9432
Epoch 2/6
469/469 [==============================] - 2s 5ms/step - loss: 0.1708 - sparse_categorical_accuracy: 0.9514 - val_loss: 0.1392 - val_sparse_categorical_accuracy: 0.9606
Epoch 3/6
469/469 [==============================] - 2s 5ms/step - loss: 0.1224 - sparse_categorical_accuracy: 0.9651 - val_loss: 0.1233 - val_sparse_categorical_accuracy: 0.9650
Epoch 4/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0956 - sparse_categorical_accuracy: 0.9725 - val_loss: 0.0988 - val_sparse_categorical_accuracy: 0.9696
Epoch 5/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0766 - sparse_categorical_accuracy: 0.9780 - val_loss: 0.0875 - val_sparse_categorical_accuracy: 0.9727
Epoch 6/6
469/469 [==============================] - 2s 5ms/step - loss: 0.0633 - sparse_categorical_accuracy: 0.9813 - val_loss: 0.0842 - val_sparse_categorical_accuracy: 0.9745
Output (without GPU)
469/469 [==============================] - 2s 1ms/step - loss: 0.3598 - sparse_categorical_accuracy: 0.9013 - val_loss: 0.1970 - val_sparse_categorical_accuracy: 0.9427
Epoch 2/6
469/469 [==============================] - 0s 933us/step - loss: 0.1705 - sparse_categorical_accuracy: 0.9511 - val_loss: 0.1449 - val_sparse_categorical_accuracy: 0.9589
Epoch 3/6
469/469 [==============================] - 0s 936us/step - loss: 0.1232 - sparse_categorical_accuracy: 0.9642 - val_loss: 0.1146 - val_sparse_categorical_accuracy: 0.9655
Epoch 4/6
469/469 [==============================] - 0s 925us/step - loss: 0.0955 - sparse_categorical_accuracy: 0.9725 - val_loss: 0.1007 - val_sparse_categorical_accuracy: 0.9690
Epoch 5/6
469/469 [==============================] - 0s 946us/step - loss: 0.0774 - sparse_categorical_accuracy: 0.9781 - val_loss: 0.0890 - val_sparse_categorical_accuracy: 0.9732
Epoch 6/6
469/469 [==============================] - 0s 971us/step - loss: 0.0647 - sparse_categorical_accuracy: 0.9811 - val_loss: 0.0844 - val_sparse_categorical_accuracy: 0.9752

At this point we can see running in CPU is way faster (x5) than running on GPU but don't become disappointed by that. We will run for large batches.

Batch size = 1024
Output (GPU)
58/59 [============================>.] - ETA: 0s - loss: 0.4862 - sparse_categorical_accuracy: 0.8680
59/59 [==============================] - 2s 11ms/step - loss: 0.4839 - sparse_categorical_accuracy: 0.8686 - val_loss: 0.2269 - val_sparse_categorical_accuracy: 0.9362
Epoch 2/6
59/59 [==============================] - 0s 8ms/step - loss: 0.1964 - sparse_categorical_accuracy: 0.9442 - val_loss: 0.1610 - val_sparse_categorical_accuracy: 0.9543
Epoch 3/6
59/59 [==============================] - 1s 9ms/step - loss: 0.1408 - sparse_categorical_accuracy: 0.9605 - val_loss: 0.1292 - val_sparse_categorical_accuracy: 0.9624
Epoch 4/6
59/59 [==============================] - 1s 9ms/step - loss: 0.1067 - sparse_categorical_accuracy: 0.9707 - val_loss: 0.1055 - val_sparse_categorical_accuracy: 0.9687
Epoch 5/6
59/59 [==============================] - 1s 9ms/step - loss: 0.0845 - sparse_categorical_accuracy: 0.9767 - val_loss: 0.0912 - val_sparse_categorical_accuracy: 0.9723
Epoch 6/6
59/59 [==============================] - 1s 9ms/step - loss: 0.0683 - sparse_categorical_accuracy: 0.9814 - val_loss: 0.0827 - val_sparse_categorical_accuracy: 0.9747
Output (without GPU)
59/59 [==============================] - 2s 15ms/step - loss: 0.4640 - sparse_categorical_accuracy: 0.8739 - val_loss: 0.2280 - val_sparse_categorical_accuracy: 0.9338
Epoch 2/6
59/59 [==============================] - 1s 12ms/step - loss: 0.1962 - sparse_categorical_accuracy: 0.9450 - val_loss: 0.1626 - val_sparse_categorical_accuracy: 0.9537
Epoch 3/6
59/59 [==============================] - 1s 12ms/step - loss: 0.1411 - sparse_categorical_accuracy: 0.9602 - val_loss: 0.1304 - val_sparse_categorical_accuracy: 0.9613
Epoch 4/6
59/59 [==============================] - 1s 12ms/step - loss: 0.1091 - sparse_categorical_accuracy: 0.9700 - val_loss: 0.1020 - val_sparse_categorical_accuracy: 0.9698
Epoch 5/6
59/59 [==============================] - 1s 12ms/step - loss: 0.0864 - sparse_categorical_accuracy: 0.9764 - val_loss: 0.0912 - val_sparse_categorical_accuracy: 0.9716
Epoch 6/6
59/59 [==============================] - 1s 12ms/step - loss: 0.0697 - sparse_categorical_accuracy: 0.9812 - val_loss: 0.0834 - val_sparse_categorical_accuracy: 0.9749

As you can see now, running on GPU is faster (x1.3) than running on CPU. Increasing the batch size can significantly improve the performance of GPU.

Fig: Runtime graph for various batch sizes.

Fig: Runtime graph for various batch sizes.

Sources:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1