'Implementing WeightedRandomSampler on imbalanced data set: RuntimeError: invalid multinomial distribution

I am trying to implement a weighted sampler for a very imbalanced data set. There are 182 different classes. Here is an array of the bin counts per class:

array([69487,  5770,  5753,   138,  4308,    10,  1161,    29,  5611,
         350,     7,   183,   218,     4,     3,  3872,     5,   950,
          33,     3,   443,    16,    20,   330,  4353,   186,    19,
         122,   546,     6,    44,     6,  3561,  2186,     3,    48,
        8440,   338,     9,   610,    74,   236,   160,   449,    72,
           6,    37,  1729,  2255,  1392,    12,     1,  3426,   513,
          44,     3,    28,    12,     9,    27,     5,    75,    15,
           3,    21,   549,     7,    25,   871,   240,   128,    28,
         253,    62,    55,    12,     8,    57,    16,    99,     6,
           5,   150,     7,   110,     8,     2,  1296,    70,  1927,
         470,     1,     1,   511,     2,   620,   946,    36,    19,
          21,    39,     6,   101,    15,     7,     1,    90,    29,
          40,    14,     1,     4,   330,  1099,  1248,  1146,  7414,
         934,   156,    80,   755,     3,     6,     6,     9,    21,
          70,   219,     3,     3,    15,    15,    12,    69,    21,
          15,     3,   101,     9,     9,    11,     6,    32,     6,
          32,  4422, 16282, 12408,  2959,  3352,   146,  1329,  1300,
        3795,    90,  1109,   120,    48,    23,     9,     1,     6,
           2,     1,    11,     5,    27,     3,     7,     1,     3,
          70,  1598,   254,    90,    20,   120,   380,   230,   180,
          10,    10])

In some classes, instances are as low as 1. I am trying to implement a Weighted random sampler from torch for this dataset. However, as the class imbalance is so large, when I calculate weights using

count_occr = np.bincount(dataset.y)
    lbl_weights = 1. / count_occr
    weights = np.array(lbl_weights)
    weights = torch.from_numpy(weights)
    sampler = WeightedRandomSampler(weights.type('torch.DoubleTensor'), len(weights*2))

I get two error messages:

RuntimeWarning: divide by zero encountered in true_divide

and

RuntimeError: invalid multinomial distribution (encountering probability entry = infinity or NaN)

Does anyone have a work around for this ? I was considering multiplying the lbl_weights by some scalar however I am not sure if this is a viable option.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source