'Mask-RCNN Training gets stuck at the beginning

My model seems to be stuck at the beginning of the training process by using the forked Mask-RCNN-TF2.7.0-keras2.8.0. There is no error so I am unable to know why. I find that many people have encountered this problem on the official Mask_RCNN (see #2168, #1899, #1962).

They have mainly provided two solution:

  1. set workers = 0, and use_multiprocessing = False, which I found that you have already done in the source file model.py;
  2. update the GPU driver. My current GPU driver version is 512.15, which I suppose is quite new.

Some people say that the code can work fine in ubuntun system, but not on windows 10. Well, this is the same thing for me, but I haven't found any solution yet. By the way, my training dataset contains only 20 images, so it should not be so difficult to train it.

The code that I run into problem is as follow, which is basically the same as the sample train_shapes.ipyb provided by Matterport team.

#### create Mask_RCNN model ####
model = modellib.MaskRCNN(mode='training', config=config, model_dir=MODEL_DIR) # Define the model

# Which weights to start with?
init_with = 'coco'  # coco, or last
ly_names=['mrcnn_class_logits', 'mrcnn_bbox_fc', 
'mrcnn_bbox', 'mrcnn_mask']
# ly_decodes=[unicode(s) for s in ly_names]

if init_with == 'coco':
    # %debug
     
    model.load_weights(COCO_MODEL_PATH, by_name=True,
                      exclude=ly_names)
elif init_with == 'last':
    model.load_weights(model.find_last(), by_name=True) # model.find_last()

#### Train the model layers ####
# Train the head layers
model.train(dataset_train, dataset_val,
            learning_rate=config.LEARNING_RATE, 
            epochs=1, # epoch=60 in original verison
            layers='heads',
            augmentation=augmentation)

# Fine-tune all layers
model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE / 10,
            epochs=1, # epoch=140 in original version
            layers='all',
            augmentation=augmentation)

#### create a config instance used for inference ####
class InferenceConfig(Config):
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1
    
    IMAGE_MAX_DIM = 1024
    IMAGE_MIN_DIM = 1024
    
    IMAGE_RESIZE_MODE = 'square'

inference_config = InferenceConfig()
batch_size = inference_config.BATCH_SIZE

#### load the last training weight for inference ####
model = modellib.MaskRCNN(mode="inference", 
                          config=inference_config,
                          model_dir=MODEL_DIR)

model.load_weights(model.find_last(), by_name=True) # 72

When I run the above code, all I can get is this:

Starting at epoch 0. LR=0.001

Checkpoint Path: C:\Mask_RCNN2.0\logs\trout20220520T1216\mask_rcnn_trout_{epoch:04d}.h5
Selecting layers to train
fpn_c5p5               (Conv2D)
fpn_c4p4               (Conv2D)
fpn_c3p3               (Conv2D)
fpn_c2p2               (Conv2D)
fpn_p5                 (Conv2D)
fpn_p2                 (Conv2D)
fpn_p3                 (Conv2D)
fpn_p4                 (Conv2D)
rpn_model              (Functional)
mrcnn_mask_conv1       (TimeDistributed)
mrcnn_mask_bn1         (TimeDistributed)
mrcnn_mask_conv2       (TimeDistributed)
mrcnn_mask_bn2         (TimeDistributed)
mrcnn_class_conv1      (TimeDistributed)
mrcnn_class_bn1        (TimeDistributed)
mrcnn_mask_conv3       (TimeDistributed)
mrcnn_mask_bn3         (TimeDistributed)
mrcnn_class_conv2      (TimeDistributed)
mrcnn_class_bn2        (TimeDistributed)
mrcnn_mask_conv4       (TimeDistributed)
mrcnn_mask_bn4         (TimeDistributed)
mrcnn_bbox_fc          (TimeDistributed)
mrcnn_mask_deconv      (TimeDistributed)
mrcnn_class_logits     (TimeDistributed)
mrcnn_mask             (TimeDistributed)
C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\optimizer_v2\gradient_descent.py:102: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  super(SGD, self).__init__(name, **kwargs)

My environment setup is

  • GPU: NVIDIA Quadro P620, 2GB (maybe it is too small for Mask_RCNN training??)
  • CUDA Driver API: 11.6
  • CUDA Runtime API: 11.2
  • cuDNN: 8.1
  • tensorflow 2.7.0
  • Keras 2.7.0 (changed due to the previous problem)
  • OS: windows 10

Is there any version incompatible for my setup? Or some other problems? Many thanks if you can help me with it.

Best,

Erin

Here I add the traceback info when I interrupt the process when using CPU training (hang at epoch 1/10).

 File C:\Mask_RCNN2.0\trout\FishSeg_0520.py:325 in <module>
    model.train(dataset_train, dataset_val,

  File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\mrcnn\model.py:2364 in train
    self.keras_model.fit(

  File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\engine\training_v1.py:777 in fit
    return func.fit(

  File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\engine\training_generator_v1.py:570 in fit
    return fit_generator(

  File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\engine\training_generator_v1.py:212 in model_iteration
    batch_data = _get_next_batch(generator)

  File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\engine\training_generator_v1.py:346 in _get_next_batch
    generator_output = next(generator)

  File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\utils\data_utils.py:499 in iter_sequence_infinite
    for item in seq:

  File C:\Apps\Anaconda3\envs\CNNTest2\lib\site-packages\keras\utils\data_utils.py:485 in __iter__
    for item in (self[i] for i in range(len(self))):


Solution 1:[1]

You should be calling the service not deployment.

golang-service is svc name instead of deployment name

kubectl exec curl -i --tty curl golang-service:9000

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Harsh Manvar