'Kernel Restarting The kernel appears to have died. It will restart automatically (Jupyter-Tensorflow GPU)
I am facing the same issue, when I attempt to train a CNN model. I use Supercomputer at our organization for training.
The model gets trained for a smaller number of Epochs. Eg if epochs=50 the model runs fine and outputs the trained model file (using callbacks in model.fit). But when epochs=75 the jupyter notebook says kernel disconnected at 54th epoch (using the same dataset).
I ran .py file by submitting it as a job through slurm script. This also gave 'segmentation fault (core dumped) error at 54th epoch (same place where jupyter notebook kernel died). So what could be the reason?
Dataset size is 80,000*5600, following is the code:
import os
import os.path
import sys
import h5py
import numpy as np
import tensorflow as tf
from tensorflow import keras
def load_file(database_file, load_data=False):
in_file = h5py.File(database_file, "r")
X_train = np.array(in_file['training/data'],dtype=np.int8)
Y_train = np.array(in_file['training/label'])
X_test = np.array(in_file['testing/data'], dtype=np.int8)
Y_test = np.array(in_file['testing/label'])
if load_data == False:
return (X_train, Y_train), (X_test, Y_test)
else:
return (X_train, Y_train), (X_test, Y_test), (in_file['train/data'], in_file['test/data'])
database = "/home/..../dataset.h5"
trained_model = "/home..../trained_epoch_test.h5"
(X_train, Y_train), (X_test, Y_test) = load_file(database_file)
X_train_scaled = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test_scaled = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
y_train_categorical = keras.utils.to_categorical(Y_train, num_classes=256)
y_test_categorical = keras.utils.to_categorical(Y_test, num_classes=256)
batch_size = 800
epochs = 1
classes=256
input_shape = (10000,1)
img_input = keras.layers.Input(shape=input_shape)
x = keras.layers.Conv1D(64, 11, activation='relu', padding='same', name='block1_conv1')(img_input)
x = keras.layers.AveragePooling1D(2, strides=2, name='block1_pool')(x)
x = keras.layers.Conv1D(128, 11, activation='relu', padding='same', name='block2_conv1')(x)
x = keras.layers.AveragePooling1D(2, strides=2, name='block2_pool')(x)
x = keras.layers.Conv1D(256, 11, activation='relu', padding='same', name='block3_conv1')(x)
x = keras.layers.AveragePooling1D(2, strides=2, name='block3_pool')(x)
x = keras.layers.Conv1D(512, 11, activation='relu', padding='same', name='block4_conv1')(x)
x = keras.layers.AveragePooling1D(2, strides=2, name='block4_pool')(x)
x = keras.layers.Conv1D(512, 11, activation='relu', padding='same', name='block5_conv1')(x)
x = keras.layers.AveragePooling1D(2, strides=2, name='block5_pool')(x)
x = keras.layers.Flatten(name='flatten')(x)
x = keras.layers.Dense(4096, activation='relu', name='fc1')(x)
x = keras.layers.Dense(4096, activation='relu', name='fc2')(x)
x = keras.layers.Dense(classes, activation='softmax', name='predictions')(x)
model = keras.models.Model(img_input, x, name='cnn_best')
optimizer = keras.optimizers.RMSprop(lr=0.00001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
save_model = keras.callbacks.ModelCheckpoint(trained_model)
callbacks=[save_model]
model.fit(x=X_train_scaled, y=y_train_categorical, batch_size=batch_size, verbose = 1, epochs=epochs, callbacks=callbacks)
OS: Linux Python version: 3.9.7 conda list: packages in environment at /home/.../test: preferred channel:https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit 10.2.89 684.g752c550
cudnn 7.6.5_10.2 650.g338a052
h5py 2.8.0 py37h8d01980_0
hdf5 1.10.2 hba1933b_1
ipython 7.29.0 py37he95b402_0
jupyter_client 7.1.2 pyhd3eb1b0_0
jupyter_core 4.9.1 py37h6ffa863_0
jupyterlab_pygments 0.1.2 py_0
keras 2.3.1 690.gf2fc3f6
keras-applications 1.0.8 py_1
keras-base 2.3.1 py37_690.gf2fc3f6
keras-gpu 2.3.1 690.gf2fc3f6
keras-preprocessing 1.1.0 py_1
matplotlib 3.4.3 py37h6ffa863_0
matplotlib-base 3.4.3 py37he087750_0
matplotlib-inline 0.1.2 pyhd3eb1b0_2
python 3.7.11 h836d2c2_0
python-dateutil 2.8.2 pyhd3eb1b0_0
pyyaml 5.4.1 py37h140841e_1
pyzmq 22.3.0 py37h29c3540_2
readline 8.1.2 h140841e_1
requests 2.22.0 py37_1
scipy 1.3.1 py37he2b7bc3_0
send2trash 1.8.0 pyhd3eb1b0_1
setuptools 58.0.4 py37h6ffa863_0
six 1.13.0 py37_0
sqlite 3.37.0 hd7247d8_0
tensorboard 2.1.1 py37_66d10d7_4000.g7f90012
tensorflow 2.1.3 gpu_py37_945.g7f90012
tensorflow-base 2.1.3 gpu_py37_77f47d6_72821.g5e36fbc
tensorflow-estimator 2.1.0 py37_7ec4e5d_1493.g7f90012
tensorflow-gpu 2.1.3 945.g7f90012
tensorrt 7.0.0.11 py37_698.g9922bde
termcolor 1.1.0 py37h6ffa863_1
terminado 0.9.4 py37h6ffa863_0
testpath 0.5.0 pyhd3eb1b0_0
tk 8.6.11 h7e00dab_0
tornado 6.1 py37h140841e_0
traitlets 5.1.1 pyhd3eb1b0_0
typing_extensions 3.10.0.2 pyh06a4308_0
uff 0.6.5 py37_698.g9922bde
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
