'Implementing WeightedRandomSampler on imbalanced data set: RuntimeError: invalid multinomial distribution
I am trying to implement a weighted sampler for a very imbalanced data set. There are 182 different classes. Here is an array of the bin counts per class:
array([69487, 5770, 5753, 138, 4308, 10, 1161, 29, 5611,
350, 7, 183, 218, 4, 3, 3872, 5, 950,
33, 3, 443, 16, 20, 330, 4353, 186, 19,
122, 546, 6, 44, 6, 3561, 2186, 3, 48,
8440, 338, 9, 610, 74, 236, 160, 449, 72,
6, 37, 1729, 2255, 1392, 12, 1, 3426, 513,
44, 3, 28, 12, 9, 27, 5, 75, 15,
3, 21, 549, 7, 25, 871, 240, 128, 28,
253, 62, 55, 12, 8, 57, 16, 99, 6,
5, 150, 7, 110, 8, 2, 1296, 70, 1927,
470, 1, 1, 511, 2, 620, 946, 36, 19,
21, 39, 6, 101, 15, 7, 1, 90, 29,
40, 14, 1, 4, 330, 1099, 1248, 1146, 7414,
934, 156, 80, 755, 3, 6, 6, 9, 21,
70, 219, 3, 3, 15, 15, 12, 69, 21,
15, 3, 101, 9, 9, 11, 6, 32, 6,
32, 4422, 16282, 12408, 2959, 3352, 146, 1329, 1300,
3795, 90, 1109, 120, 48, 23, 9, 1, 6,
2, 1, 11, 5, 27, 3, 7, 1, 3,
70, 1598, 254, 90, 20, 120, 380, 230, 180,
10, 10])
In some classes, instances are as low as 1. I am trying to implement a Weighted random sampler from torch for this dataset. However, as the class imbalance is so large, when I calculate weights using
count_occr = np.bincount(dataset.y)
lbl_weights = 1. / count_occr
weights = np.array(lbl_weights)
weights = torch.from_numpy(weights)
sampler = WeightedRandomSampler(weights.type('torch.DoubleTensor'), len(weights*2))
I get two error messages:
RuntimeWarning: divide by zero encountered in true_divide
and
RuntimeError: invalid multinomial distribution (encountering probability entry = infinity or NaN)
Does anyone have a work around for this ? I was considering multiplying the lbl_weights by some scalar however I am not sure if this is a viable option.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
