'Running out of memory on Google Colab

I'm trying to run a demo of TF Object Detection model with Faster RCNN on Google Colab Pro GPU (RAM: 25GB, Disk: 147GB), but it fails and gives me the following error:

Tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (GPU_0_bfc) ran out of memory trying to allocate 7.18GiB (rounded to 7707033600)requested by op MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/AvgPool-0-TransposeNHWCToNCHW-LayoutOptimizer
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 

Then it gives me these stats:

I tensorflow/core/common_runtime/bfc_allocator.cc:1058] Sum Total of in-use chunks: 7.46GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:1060] total_region_allocated_bytes_: 15034482688 memory_limit_: 16183459840 available bytes: 1148977152 curr_region_allocation_bytes_: 8589934592
I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats: 
Limit:                     16183459840
InUse:                      8013051904
MaxInUse:                   8081602560
NumAllocs:                        6801
MaxAllocSize:               7707033600
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

And

tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[2400,1024,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/AvgPool-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference__dummy_computation_fn_32982]

I don't really understand why it runs out of memory allocating only 7GB on a 25GB system? How can I fix it? Here is my config file for this task:

# Faster R-CNN with Resnet-50 (v1)
# Trained on COCO, initialized from Imagenet classification checkpoint

# Achieves -- mAP on COCO14 minival dataset.

# This config is TPU compatible.

model {
  faster_rcnn {
    num_classes: 7
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 640
        max_dimension: 640
        pad_to_max_dimension: true
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet50_keras'
      batch_norm_trainable: true
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
        share_box_across_classes: true
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
    use_static_shapes: true
    use_matmul_crop_and_resize: true
    clip_anchors_to_image: true
    use_static_balanced_label_sampler: true
    use_matmul_gather_in_matcher: true
  }
}

train_config: {
  batch_size: 8
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 25000
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: .04
          total_steps: 25000
          warmup_learning_rate: .013333
          warmup_steps: 2000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  fine_tune_checkpoint_version: V2
  fine_tune_checkpoint: "faster_rcnn_resnet50_v1_640x640_coco17_tpu-8/checkpoint/ckpt-0"
  fine_tune_checkpoint_type: "detection"
  data_augmentation_options {
    random_horizontal_flip {
    }
  }

  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  use_bfloat16: true  # works only on TPUs
}

train_input_reader: {
  label_map_path: "label_map.pbtxt"
  tf_record_input_reader {
    input_path: "train.record"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1;
}

eval_input_reader: {
  label_map_path: "label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "test.record"
  }
}



Solution 1:[1]

I also ran into the same problem. It took me a week to figure it out. I am using colab pro+. tesla p100 GPU(16GB) was allocated to me. My image dim is (256,256,4) and the batch size is 32. The thing is when we design a architecture we don't think about the size of params until we ran into problems like RESOURCE EXHAUST ERROR. Then we make some changes, trying to minimize the params. But there is also another factor that takes up the memory. That is in my case I use four separate variables to hold the tensors.

 block_0=layers.Conv2D(filters=32, kernel_size=(1,1),strides=(1, 1), activation="LeakyReLU",padding='same',name='block_5_layer_1'))(x)
 block_1=tfa.layers.SpectralNormalization(layers.Conv2D(filters=64, kernel_size=(3,3),strides=(1,1), activation="LeakyReLU",padding='same',name='block_5_layer_2'))(x)
 block_1=tfa.layers.SpectralNormaliza``tion(layers.Conv2D(filters=64, kernel_size=(5,5),strides=(1, 1),padding='same',name='block_5_layer_3'))(block_1)
 # block_1=layers.BatchNormalization()(block_1)
 block_1=layers.Activation('LeakyReLU')(block_1)
 block_1=layers.Dropout(0.7)(block_1)

 block_2=tfa.layers.SpectralNormalization(layers.Conv2D(filters=32, kernel_size=(1,1),strides=(1, 1), activation="LeakyReLU",padding='same',name='block_5_layer_4'))(x)
 block_2=tfa.layers.SpectralNormalization(layers.Conv2D(filters=96, kernel_size=(3,3),strides=(1, 1), activation="LeakyReLU",padding='same',name='block_5_layer_5'))(block_2)
 block_2=layers.Dropout(0.7)(block_2)

 block_3=layers.MaxPool2D(pool_size=(3,3),strides=(1,1 ),padding='same',name='block_5_maxpool_1') (x)
 block_3=tfa.layers.SpectralNormalization(layers.Conv2D(filters=64, kernel_size=(3,3),strides=(1, 1), activation="LeakyReLU",padding='same',name='block_5_layer_6'))(block_3)


 x = layers.concatenate(
  [block_0, block_1, block_2, block_3],
  axis=3,
  name='block_5')

For me RESOURCE EXHAUST ERROR occurred at the very first layer. The resulted shape is (32,256,256,352) which is huge and I suppose those tensors are stored in the GPU itself. So this takes up lots of space and that is why TensorFlow can't allocate memory to the layers. And when I reduced its dims it worked. So I think we should also consider the shape of the variables holding the convoluted images along with the param size. Correct me if I am wrong.

Solution 2:[2]

I realized it's the problem with the images taking up too much memory in a sample size, as per https://github.com/tensorflow/models/issues/1817, so I went and changed my batch size to 2 and it worked

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dcode
Solution 2 Nha Binh Chang