'Generating multiple labels for documents

Currently, i am working on a task where we are scraping pages from web and trying to generate labels for each webpage.

For that, we have extracted the text data from those websites, performed preprocessing (like removal of stopwords, removal of punctuations, non-ascii characters,etc) and then used tf-idf to find weightage of each word within a document.Then we are selecting words whose tf-idf value is above a specific threshold, and finally assigning labels to documents by comparing word2vec values of the extracted labels using tf-idf with custom defined labels(ones we have already pre-defined like web, nature, business,etc) using cosine similarity.

Does this sound like it would produce good results ? any tips to simplify / improve the process of generating multiple labels for the docs ?



Solution 1:[1]

If you're already running this, you know something better than "does this sound like it would produce good results?" You know the actual results.

Does an informal look over the results seem to deliver the results you need?

More generally, you have created an ad-hoc "text classifier" using a bunch of folk techniques mish-mashed together in series.

But, if you work through some online intro-tutorials to text classification with Python, you'll see that the techniques you've chosen are just a few of many.

You can independently choose how you preprocess/tokenize the text, or enrich it with other features (like TF-IDF weightings or word-vectors).

Then, you can try many different classification-algorithms – which can be far more sophisticated than "pick the highest-weighted TF-IDF words per doc & cosine-similarity compare them to some prepicked label categories".

That ad hoc approach may work OK, as a way to explore, or as a baseline. But other techniques can, when given a number of examples (as text or text-preprocessed-into-other-features) and a number of 'known labels' (which you initially create by hand), make far more sophisticated distinctions between texts, and so would likely perform better – if and when you have sufficient training data.

A reasonable place to get started would be working through the scikit-learn project's "Working With Text Data" tutorial:

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Especially, you may want to start structuring your task as a 'pipeline' of steps, that you can then add/remove/tune independently, as shown there.

And, add a rigorous, repeatable, quantitative (automated) evaluation at the end – so that you get some set of summary scores for every alternative step/parameterization you choose, to know which are improving things (worth the compexity) & which may not be important. (For example, prematurely removing stop words & punctuation might make some classifiers work worse! But the only way to know is to try it both ways.)

Other good resources for understanding related matters:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 gojomo