'Mask-RCNN Training gets stuck at the beginning
My model seems to be stuck at the beginning of the training process by using the forked Mask-RCNN-TF2.7.0-keras2.8.0. There is no error so I am unable to know why. I find that many people have encountered this problem on the official Mask_RCNN (see #2168, #1899, #1962).
They have mainly provided two solution:
- set
workers = 0, anduse_multiprocessing = False, which I found that you have already done in the source filemodel.py; - update the GPU driver. My current GPU driver version is 512.15, which I suppose is quite new.
Some people say that the code can work fine in ubuntun system, but not on windows 10. Well, this is the same thing for me, but I haven't found any solution yet. By the way, my training dataset contains only 20 images, so it should not be so difficult to train it.
The code that I run into problem is as follow, which is basically the same as the sample train_shapes.ipyb provided by Matterport team.
#### create Mask_RCNN model ####
model = modellib.MaskRCNN(mode='training', config=config, model_dir=MODEL_DIR) # Define the model
# Which weights to start with?
init_with = 'coco' # coco, or last
ly_names=['mrcnn_class_logits', 'mrcnn_bbox_fc',
'mrcnn_bbox', 'mrcnn_mask']
# ly_decodes=[unicode(s) for s in ly_names]
if init_with == 'coco':
# %debug
model.load_weights(COCO_MODEL_PATH, by_name=True,
exclude=ly_names)
elif init_with == 'last':
model.load_weights(model.find_last(), by_name=True) # model.find_last()
#### Train the model layers ####
# Train the head layers
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=1, # epoch=60 in original verison
layers='heads',
augmentation=augmentation)
# Fine-tune all layers
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE / 10,
epochs=1, # epoch=140 in original version
layers='all',
augmentation=augmentation)
#### create a config instance used for inference ####
class InferenceConfig(Config):
GPU_COUNT = 1
IMAGES_PER_GPU = 1
IMAGE_MAX_DIM = 1024
IMAGE_MIN_DIM = 1024
IMAGE_RESIZE_MODE = 'square'
inference_config = InferenceConfig()
batch_size = inference_config.BATCH_SIZE
#### load the last training weight for inference ####
model = modellib.MaskRCNN(mode="inference",
config=inference_config,
model_dir=MODEL_DIR)
model.load_weights(model.find_last(), by_name=True) # 72
When I run the above code, all I can get is this:
Starting at epoch 0. LR=0.001
Checkpoint Path: C:\Mask_RCNN2.0\logs\trout20220520T1216\mask_rcnn_trout_{epoch:04d}.h5
Selecting layers to train
fpn_c5p5 (Conv2D)
fpn_c4p4 (Conv2D)
fpn_c3p3 (Conv2D)
fpn_c2p2 (Conv2D)
fpn_p5 (Conv2D)
fpn_p2 (Conv2D)
fpn_p3 (Conv2D)
fpn_p4 (Conv2D)
rpn_model (Functional)
mrcnn_mask_conv1 (TimeDistributed)
mrcnn_mask_bn1 (TimeDistributed)
mrcnn_mask_conv2 (TimeDistributed)
mrcnn_mask_bn2 (TimeDistributed)
mrcnn_class_conv1 (TimeDistributed)
mrcnn_class_bn1 (TimeDistributed)
mrcnn_mask_conv3 (TimeDistributed)
mrcnn_mask_bn3 (TimeDistributed)
mrcnn_class_conv2 (TimeDistributed)
mrcnn_class_bn2 (TimeDistributed)
mrcnn_mask_conv4 (TimeDistributed)
mrcnn_mask_bn4 (TimeDistributed)
mrcnn_bbox_fc (TimeDistributed)
mrcnn_mask_deconv (TimeDistributed)
mrcnn_class_logits (TimeDistributed)
mrcnn_mask (TimeDistributed)
C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\optimizer_v2\gradient_descent.py:102: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
super(SGD, self).__init__(name, **kwargs)
My environment setup is
- GPU: NVIDIA Quadro P620, 2GB (maybe it is too small for Mask_RCNN training??)
- CUDA Driver API: 11.6
- CUDA Runtime API: 11.2
- cuDNN: 8.1
- tensorflow 2.7.0
- Keras 2.7.0 (changed due to the previous problem)
- OS: windows 10
Is there any version incompatible for my setup? Or some other problems? Many thanks if you can help me with it.
Best,
Erin
Here I add the traceback info when I interrupt the process when using CPU training (hang at epoch 1/10).
File C:\Mask_RCNN2.0\trout\FishSeg_0520.py:325 in <module>
model.train(dataset_train, dataset_val,
File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\mrcnn\model.py:2364 in train
self.keras_model.fit(
File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\engine\training_v1.py:777 in fit
return func.fit(
File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\engine\training_generator_v1.py:570 in fit
return fit_generator(
File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\engine\training_generator_v1.py:212 in model_iteration
batch_data = _get_next_batch(generator)
File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\engine\training_generator_v1.py:346 in _get_next_batch
generator_output = next(generator)
File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\utils\data_utils.py:499 in iter_sequence_infinite
for item in seq:
File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\utils\data_utils.py:485 in __iter__
for item in (self[i] for i in range(len(self))):
Solution 1:[1]
You should be calling the service not deployment.
golang-service is svc name instead of deployment name
kubectl exec curl -i --tty curl golang-service:9000
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Harsh Manvar |
