'Imbalanced multiclass classification using company names

I have this classification scenario below in which Im getting a very low F1, precision, recall and other metrics.

  1. Target is multiclass (about ~200 classes) which is highly imbalanced
  2. I only use company names as classifier (mostly 1-2 words which have max of 8 words), no other fields (like description, etc.)
  3. Training data ~ 100k+ records
  4. Preprocessing: numeric and special characters and stopwords removal
  5. I have very low resources for processing (thats why when I try to use oversampling techniques like smote, distance_smote for multiclass, etc., I always get memory error)
  6. Tried using different vectorization/embedding/tokenizer like word2vec, tfidf, fasttext, bert, roberta, etc. but to no avail
  7. Tried using (and fine-tuning) different algorithms (networks, svm, trees, boosting, etc.) but also getting low scores.
  8. I also did cost-sensitive learning (using class weights) but it only decreased my scores.

Tried all options that I know but scores are not increasing. Can you recommend other options here or do you think any part of the process that may be wrong/discarded? Thank you!

Distribution of target labels: Distribution of target labels

Sample observations Sample observations



Solution 1:[1]

There is essentially no way to know that 'Exxon' is an oil company, and 'Apple' a computer company, and 'McDonalds' a fast-food chain, just from their company names.

Even if you have a list of every other company in the world, by name and type, that's not enough to make the deduction for these last 3. Only other outside info – like a few sentences about them, or other data – could classify them.

In fact, while company names sometimes describe their exact field-of-commerce, often they're totally arbitrary, as that gives them more freedom to range over many products/services, or create their own unique associations with the name (aka branding).

So I strongly suspect your (unshown) names & (unshown) labels are just too arbitrary for the data you're using to get very good at the task you're attempting.

Is there a real-world situation where someone will only have a company name – no other info, or research options – and benefit from correctly guessing the class? If so, more specifics about the situation might help generate more specific tactical recommendations. But mainly such recommendations will be: get richer data about the targets of the classification.

You might squeeze a little more out of vague trends in corporate naming via better preprocessing/feature-extraction. You may want to keep numbers, special-characters, & punctuation in some form, as they might include extra slight hints. Using subwords (character n-grams) might also reveal some shared word-roots used even in made-up names.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 gojomo