'Undersampling a multi-label DataFrame using pandas

I have a DataFrame like this:

file_name                                                label

../input/image-classification-screening/train/...         1
../input/image-classification-screening/train/...         7
../input/image-classification-screening/train/...         9
../input/image-classification-screening/train/...         9
../input/image-classification-screening/train/...         6

And it has 11 classes (0 to 10) and has high class imbalance. Below is the output of train['label'].value_counts():

6     6285
3     4139
9     3933
7     3664
2     2778
5     2433
8     2338
0     2166
4     2052
10    1039
1      922

How do I under-sample this data in pandas so that each class will have below 2500 examples? I want to remove data points randomly from majority classes like 6, 3, 9, 7 and 2.



Solution 1:[1]

You can use sample in a groupby.apply. Here s a reproductible example with 4 unbalanced labels.

np.random.seed(1)
df = pd.DataFrame({
    'a':range(100), 
    'label':np.random.choice(range(4), size=100, p=[0.5,0.3,0.18,0.02])})
print(df['label'].value_counts())
# 0    51
# 1    30
# 2    18
# 3     1
# Name: label, dtype: int64

Now to select maximum 25 maximum (replace by 2500 for you) per label, you do:

nMax = 25 #change to 2500

res = df.groupby('label').apply(lambda x: x.sample(n=min(nMax, len(x))))

print(res['label'].value_counts())
# 0    25 # see how label 0 and 1 are now 25
# 1    25
# 2    18 # and the smaller groups stay the same
# 3     1
# Name: label, dtype: int64

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ben.T