'Imbalanced multiclass classification using company names
I have this classification scenario below in which Im getting a very low F1, precision, recall and other metrics.
- Target is multiclass (about ~200 classes) which is highly imbalanced
- I only use company names as classifier (mostly 1-2 words which have max of 8 words), no other fields (like description, etc.)
- Training data ~ 100k+ records
- Preprocessing: numeric and special characters and stopwords removal
- I have very low resources for processing (thats why when I try to use oversampling techniques like smote, distance_smote for multiclass, etc., I always get memory error)
- Tried using different vectorization/embedding/tokenizer like word2vec, tfidf, fasttext, bert, roberta, etc. but to no avail
- Tried using (and fine-tuning) different algorithms (networks, svm, trees, boosting, etc.) but also getting low scores.
- I also did cost-sensitive learning (using class weights) but it only decreased my scores.
Tried all options that I know but scores are not increasing. Can you recommend other options here or do you think any part of the process that may be wrong/discarded? Thank you!
Solution 1:[1]
There is essentially no way to know that 'Exxon' is an oil company, and 'Apple' a computer company, and 'McDonalds' a fast-food chain, just from their company names.
Even if you have a list of every other company in the world, by name and type, that's not enough to make the deduction for these last 3. Only other outside info – like a few sentences about them, or other data – could classify them.
In fact, while company names sometimes describe their exact field-of-commerce, often they're totally arbitrary, as that gives them more freedom to range over many products/services, or create their own unique associations with the name (aka branding).
So I strongly suspect your (unshown) names & (unshown) labels are just too arbitrary for the data you're using to get very good at the task you're attempting.
Is there a real-world situation where someone will only have a company name – no other info, or research options – and benefit from correctly guessing the class? If so, more specifics about the situation might help generate more specific tactical recommendations. But mainly such recommendations will be: get richer data about the targets of the classification.
You might squeeze a little more out of vague trends in corporate naming via better preprocessing/feature-extraction. You may want to keep numbers, special-characters, & punctuation in some form, as they might include extra slight hints. Using subwords (character n-grams) might also reveal some shared word-roots used even in made-up names.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | gojomo |


