'For a binary classification, how can I sample data by column so that 2/3 of the data contains zeros and 1/3 contains ones?
I have a large dataset containing fur columns and the third column contains the (binay) label (a value either 0 or 1). This dataset is imbalanced - it contains much more zeros than ones. The data looks like:
3 5 0 0.4
4 5 1 0.1
5 13 0 0.5
6 10 0 0.8
7 25 1 0.3
: : : :
I know that I can obtain a balanced subset containing 50% zeros and 50% ones by for example:
df_sampled = df.groupby(df.iloc[:,2]).sample(n=20000, random_state=1)
But how I can ammend the oneliner given above to change the ratio of zeros and ones? For example how can I sample this data (by the third column) so that 2/3 of the data contains zeros and 1/3 contains ones?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
