'Balancing a multilabel dataset using Julia

I have a dataframe like this:

id   text               feat_1  feat_2   feat_3   feat_n
1    random coments        0      0        1       0
2    random coments2       1      0        1       0
1    random coments3       1      1        1       1

Feat columns goes from 1 to 100 and they are labels of a multilabel dataset. The type of data as is 1 and 0 (boolean)

The dataset has over 50k records the labels are unbalance. I am looking for a way to balance it and I was working on this approach:

Sum the values in each feat column and then use the lowest value of this sum as a threshold to filter the dataset.

I need to keep all features columns so I can exclude comments to achieve.

The main idea boild down to: i need to get a balanced dataset to use in a multilabel classification problem, i mean, I need the same amount of feat_columns data as they are my labels.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source