'Training loss for Faster-RCNN either becoming Nan or infinity

I want to implement Pytorch Faster-RCNN module on a custom dataset that I curated and labelled. The implementation detail looks straightforward, there was a demo that showed training and inference on a custom dataset (a person detection problem). Just like the person detection dataset, where there is only one class (person) along with background class, my personal dataset also has only one class. I therefore saw no need to make any changes to the hyper-parameters. It is unclear what is causing the training loss to either become Nan or infinity. This happens on the first epoch. Grateful for any suggestions on this issue.

Edit: I found that my masks were sometimes out of bounds, which was blowing up the loss, so the cumulative loss was either going positive infinity or negative infinity. It probably would not have mattered but for the fact that train_one_epoch function provided by the PyTorch Detection Colab Notebook example with PennFudan dataset added a math condition that sys.exit() -ed when the loss was going math.isfinite().



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source