'Will NER improve Text Categorization?

I was wondering - if I'm doing text categorization (with SpaCy, using their textcat-multi component for example), will those results improve if an NER component was before it in the pipeline? My thinking about that is: if a sentence like "Senior Javascript Developer" would be categorized as, say, "A" (or any other category), and if then Javascript would be tagged as a "Programming Language" entity or similar, would the textcat pick that up, and use that to say, for example, a sentence like "Python Engineer", is similar because of that entity (and will also be categorized in "A")? Assuming Python is also a "Programming Language" entity of course.

My understanding of it is that the textcat component will take the tok2vec vectors and look for similarity there, but will the vectors be similar in one or more dimensions if the found entity using NER is similar? Am I thinking about this the right way? If it's at all possible, how would that work with SpaCy?



Solution 1:[1]

Simply adding an NER component in the pipeline will not improve things, no.

If you add an NER component and train it with the textcat component jointly, you can get shared representations, which could help in theory. In practice it seems unlikely to help though.

This has been asked in the spaCy forums before, and here I responded to the idea in some detail. Basically though what limited research I could find on using NER features for text classification suggests it doesn't help much.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 polm23