'How to handle skewed categorical data for multiclass-classification task?

I want to know how to handle the skewed data which contains a particular column that has multiple categorical values. Some of these values have more value_counts() than others.y is the count and x is the unique value As you can see in this data the values greater than 7 have value counts lot less than others. How to handle this kind of skewed data? (This is not the target variable. I want to know about skewed independent variable)

I tried changing ' these smaller count values to a particular value (-1). That way I got count of -1 comparable to other values. But training classification model on this data will affect the accuracy. converted smaller count values to -1



Solution 1:[1]

Oversampling techniques for minority classes/categories may not work well in many scenarios. You could read more about them here.

One thing you could do is to assign different weights to samples from different classes in your model's loss function, inversely proportional to their frequencies. This would ensure that even classes with few datapoints will equally affect the model's loss, as compared to classes with large number of datapoints.

You could share more details about the dataset or the specific model that you are using, to get more specific suggestions/solutions.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Aravind G.