'Balancing a multilabel dataset using Julia
I have a dataframe like this:
id text feat_1 feat_2 feat_3 feat_n
1 random coments 0 0 1 0
2 random coments2 1 0 1 0
1 random coments3 1 1 1 1
Feat columns goes from 1 to 100 and they are labels of a multilabel dataset. The type of data as is 1 and 0 (boolean)
The dataset has over 50k records the labels are unbalance. I am looking for a way to balance it and I was working on this approach:
Sum the values in each feat column and then use the lowest value of this sum as a threshold to filter the dataset.
I need to keep all features columns so I can exclude comments to achieve.
The main idea boild down to: i need to get a balanced dataset to use in a multilabel classification problem, i mean, I need the same amount of feat_columns data as they are my labels.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
