'GRU (gated recurrent unit) not working on GPU (TensorFlow)
I am trying to train RNN models on my GPU (NVIDIA RTX3080) using TensorFlow, however GRU cells are not working properly.
When training LSTM models, it works fine and it takes only few seconds.
Example
act = "tanh"
recurrent_act = "sigmoid"
lstm_model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(32, return_sequences=True, activation=act, recurrent_activation=recurrent_act),
tf.keras.layers.LSTM(32, return_sequences=True, activation=act, recurrent_activation=recurrent_act),
tf.keras.layers.Dense(units=8),
tf.keras.layers.Dense(units=1)
])
history = compile_and_fit(lstm_model, wide_window_d)
# Epoch 1/120
# 368/368 [==============================] - 12s 17ms/step - loss: 0.2664 - mean_absolute_error: 0.2883 - val_loss: 0.0273 - val_mean_absolute_error: 0.0845
# Epoch 2/120
# 368/368 [==============================] - 5s 14ms/step - loss: 0.0067 - mean_absolute_error: 0.0381 - val_loss: 0.0063 - val_mean_absolute_error: 0.0435
However, when I use GRU cells, training takes 10x more time.
gru_model = tf.keras.models.Sequential([
tf.keras.layers.GRU(32, return_sequences=True, activation=act, recurrent_activation=recurrent_act),
tf.keras.layers.GRU(32, return_sequences=True, activation=act, recurrent_activation=recurrent_act),
tf.keras.layers.Dense(units=1)
])
history = compile_and_fit(gru_model, wide_window_d)
# Epoch 1/120
# 368/368 [==============================] - 49s 129ms/step - loss: 0.1086 - mean_absolute_error: 0.1560 - val_loss: 0.0101 - val_mean_absolute_error: 0.0498
# Epoch 2/120
# 368/368 [==============================] - 48s 130ms/step - loss: 0.0018 - # mean_absolute_error: 0.0210 - val_loss: 0.0038 - val_mean_absolute_error: 0.0320
But the biggest problem is when I use Bidirectional GRU, because training time increases until it gets stuck and I have to restart kernel.
gru_model_bidirectional = tf.keras.models.Sequential([
tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32, return_sequences=True)),
tf.keras.layers.Dense(units=1)
])
history = compile_and_fit(gru_model_bidirectional, wide_window_d)
# Epoch 1/120
# 33/368 [=>............................] - ETA: 1:19 - loss: 0.5314 - mean_absolute_error: 0.4526
!! It always gets stuck at this point. Few seconds after start and I have to restart kernel.!!
My specs and versions right now
I am using anaconda.
- Tensorflow: 2.4.1
- cudatoolkit: 11.2.1
- cudnn: 8.1.0.77
- Python: 3.8
I have tried so far
I have tried to install various versions of tensorflow (even tf-nightly) and also other versions on cuda and cudnn, but I get stuck on Bidirectional GRU everytime.
I have also red that there might be some problem with GPUs memory that is why I added this to my code (after tensorflow import)
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)
The problem with some older versions was that they do not support RTX30xx series cards.
Note
On cpu it also works fine (all those models above), however with larger models training takes too long on CPU.
So, if anyone knows why LSTM cells work totally fine (even bidirectional) and GRU cells are problematic please let me know. Thank you very much.
EDIT 1
Whole code
I have this class to work with models
class MyModel():
def __init__(self, model):
self.model = model
def load_model(self, dir_name):
self.model = load_model(dir_name)
def eval_mod(self, window, verbose):
res = self.model.evaluate(window, verbose=verbose)
print("Loss:", res[0], "MAE:", res[1])
def save_model(self, name):
self.model.save("models\\models_05_04_2021\\"+ name +".model")
def retrain_model(self, window, patience, epochs):
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience, mode='min', restore_best_weights=True)
history = self.model.fit(window.train, epochs=epochs, validation_data=window.val, callbacks=[early_stopping])
self.history = history
def compile_and_fit(self, window, patience=3, epochs=120):
## callbacks list https://www.tensorflow.org/api_docs/python/tf/keras/callbacks
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
patience=patience,
mode='min', restore_best_weights=True)
# https://github.com/Jaewan-Yun/optimizer-visualization
# https://www.tensorflow.org/api_docs/python/tf/keras/Model
opt = tf.optimizers.Adam()
self.model.compile(loss=tf.losses.MeanSquaredError(),
optimizer=opt,
metrics=[tf.metrics.MeanAbsoluteError()])
history = self.model.fit(window.train, epochs=epochs,
validation_data=window.val,
callbacks=[early_stopping])
return history
Then training
act = "tanh"
recurrent_act = "sigmoid"
lstm_model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(512, return_sequences=True, activation=act, recurrent_activation=recurrent_act),
tf.keras.layers.Dense(32, activation=act),
tf.keras.layers.Dense(units=1)
])
lstm = MyModel(lstm_model)
history = lstm.compile_and_fit(wide_window_d)
Window is created using WindowGenerator class from https://www.tensorflow.org/tutorials/structured_data/time_series
wide_window_my = WindowGenerator(
input_width=24, label_width=24, shift=FUTURE_PERIOD_PREDICT,train_df = train_df_my,val_df = val_df_my, test_df = test_df_my,label_columns=['to_predict'])
Solution 1:[1]
Updating Tensorflow to version > 2.5.0 solved the issue.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
