'Undersampling a multi-label DataFrame using pandas
I have a DataFrame like this:
file_name label
../input/image-classification-screening/train/... 1
../input/image-classification-screening/train/... 7
../input/image-classification-screening/train/... 9
../input/image-classification-screening/train/... 9
../input/image-classification-screening/train/... 6
And it has 11 classes (0 to 10) and has high class imbalance. Below is the output of train['label'].value_counts():
6 6285
3 4139
9 3933
7 3664
2 2778
5 2433
8 2338
0 2166
4 2052
10 1039
1 922
How do I under-sample this data in pandas so that each class will have below 2500 examples? I want to remove data points randomly from majority classes like 6, 3, 9, 7 and 2.
Solution 1:[1]
You can use sample in a groupby.apply. Here s a reproductible example with 4 unbalanced labels.
np.random.seed(1)
df = pd.DataFrame({
'a':range(100),
'label':np.random.choice(range(4), size=100, p=[0.5,0.3,0.18,0.02])})
print(df['label'].value_counts())
# 0 51
# 1 30
# 2 18
# 3 1
# Name: label, dtype: int64
Now to select maximum 25 maximum (replace by 2500 for you) per label, you do:
nMax = 25 #change to 2500
res = df.groupby('label').apply(lambda x: x.sample(n=min(nMax, len(x))))
print(res['label'].value_counts())
# 0 25 # see how label 0 and 1 are now 25
# 1 25
# 2 18 # and the smaller groups stay the same
# 3 1
# Name: label, dtype: int64
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ben.T |
