'Running out of memory on Google Colab
I'm trying to run a demo of TF Object Detection model with Faster RCNN on Google Colab Pro GPU (RAM: 25GB, Disk: 147GB), but it fails and gives me the following error:
Tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (GPU_0_bfc) ran out of memory trying to allocate 7.18GiB (rounded to 7707033600)requested by op MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/AvgPool-0-TransposeNHWCToNCHW-LayoutOptimizer
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Then it gives me these stats:
I tensorflow/core/common_runtime/bfc_allocator.cc:1058] Sum Total of in-use chunks: 7.46GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:1060] total_region_allocated_bytes_: 15034482688 memory_limit_: 16183459840 available bytes: 1148977152 curr_region_allocation_bytes_: 8589934592
I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats:
Limit: 16183459840
InUse: 8013051904
MaxInUse: 8081602560
NumAllocs: 6801
MaxAllocSize: 7707033600
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
And
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2400,1024,28,28] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/AvgPool-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference__dummy_computation_fn_32982]
I don't really understand why it runs out of memory allocating only 7GB on a 25GB system? How can I fix it? Here is my config file for this task:
# Faster R-CNN with Resnet-50 (v1)
# Trained on COCO, initialized from Imagenet classification checkpoint
# Achieves -- mAP on COCO14 minival dataset.
# This config is TPU compatible.
model {
faster_rcnn {
num_classes: 7
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 640
max_dimension: 640
pad_to_max_dimension: true
}
}
feature_extractor {
type: 'faster_rcnn_resnet50_keras'
batch_norm_trainable: true
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
share_box_across_classes: true
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
use_static_shapes: true
use_matmul_crop_and_resize: true
clip_anchors_to_image: true
use_static_balanced_label_sampler: true
use_matmul_gather_in_matcher: true
}
}
train_config: {
batch_size: 8
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
num_steps: 25000
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .04
total_steps: 25000
warmup_learning_rate: .013333
warmup_steps: 2000
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
fine_tune_checkpoint_version: V2
fine_tune_checkpoint: "faster_rcnn_resnet50_v1_640x640_coco17_tpu-8/checkpoint/ckpt-0"
fine_tune_checkpoint_type: "detection"
data_augmentation_options {
random_horizontal_flip {
}
}
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
use_bfloat16: true # works only on TPUs
}
train_input_reader: {
label_map_path: "label_map.pbtxt"
tf_record_input_reader {
input_path: "train.record"
}
}
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1;
}
eval_input_reader: {
label_map_path: "label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "test.record"
}
}
Solution 1:[1]
I also ran into the same problem. It took me a week to figure it out. I am using colab pro+. tesla p100 GPU(16GB) was allocated to me. My image dim is (256,256,4) and the batch size is 32. The thing is when we design a architecture we don't think about the size of params until we ran into problems like RESOURCE EXHAUST ERROR. Then we make some changes, trying to minimize the params. But there is also another factor that takes up the memory. That is in my case I use four separate variables to hold the tensors.
block_0=layers.Conv2D(filters=32, kernel_size=(1,1),strides=(1, 1), activation="LeakyReLU",padding='same',name='block_5_layer_1'))(x)
block_1=tfa.layers.SpectralNormalization(layers.Conv2D(filters=64, kernel_size=(3,3),strides=(1,1), activation="LeakyReLU",padding='same',name='block_5_layer_2'))(x)
block_1=tfa.layers.SpectralNormaliza``tion(layers.Conv2D(filters=64, kernel_size=(5,5),strides=(1, 1),padding='same',name='block_5_layer_3'))(block_1)
# block_1=layers.BatchNormalization()(block_1)
block_1=layers.Activation('LeakyReLU')(block_1)
block_1=layers.Dropout(0.7)(block_1)
block_2=tfa.layers.SpectralNormalization(layers.Conv2D(filters=32, kernel_size=(1,1),strides=(1, 1), activation="LeakyReLU",padding='same',name='block_5_layer_4'))(x)
block_2=tfa.layers.SpectralNormalization(layers.Conv2D(filters=96, kernel_size=(3,3),strides=(1, 1), activation="LeakyReLU",padding='same',name='block_5_layer_5'))(block_2)
block_2=layers.Dropout(0.7)(block_2)
block_3=layers.MaxPool2D(pool_size=(3,3),strides=(1,1 ),padding='same',name='block_5_maxpool_1') (x)
block_3=tfa.layers.SpectralNormalization(layers.Conv2D(filters=64, kernel_size=(3,3),strides=(1, 1), activation="LeakyReLU",padding='same',name='block_5_layer_6'))(block_3)
x = layers.concatenate(
[block_0, block_1, block_2, block_3],
axis=3,
name='block_5')
For me RESOURCE EXHAUST ERROR occurred at the very first layer. The resulted shape is (32,256,256,352) which is huge and I suppose those tensors are stored in the GPU itself. So this takes up lots of space and that is why TensorFlow can't allocate memory to the layers. And when I reduced its dims it worked. So I think we should also consider the shape of the variables holding the convoluted images along with the param size. Correct me if I am wrong.
Solution 2:[2]
I realized it's the problem with the images taking up too much memory in a sample size, as per https://github.com/tensorflow/models/issues/1817, so I went and changed my batch size to 2 and it worked
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Dcode |
Solution 2 | Nha Binh Chang |