'How to Measure Similarity or Difference of Meaning Between Words? [closed]
Say you have two random words ('yellow' and 'ambient' or 'goose' and 'kettle'). What tech could be used to rate how similar or different they are in meaning as informed by popular usage? For example, from 0 to 1 where antonyms are 0 and synonyms are 1, 'yellow' and 'ambient' might be 0.65 similar.
Note: I'm not talking about how close the two strings are to each other, but rather an approximation of how similar their meanings are.
Solution 1:[1]
One approach that works quite well for measuring semantic similarity is looking at contexts in which the two words in question occur: as the distribution of words is not random, the context carries a lot of information, up to the point where you can guess what a word means (eg in a foreign language) which you don't know, as long as you understand the words around it.
In my PhD thesis I have investigated that approach in various ways; I have taken instances of a word from a corpus, and recorded their contexts, ie the n words to their left and right. Then you do this for another word, and compare how similar the contexts are. This will give you a value between 0 and 1, depending on the metric you use.
You can treat the contexts as frequency lists, and then compare the frequencies of the same word in both contexts, or you can be more specific and compare them also by distance to the target word. The more specific you are, the more data you need, but the more accurate your results will be.
One caveat is words with different meanings (homographs): the word left can be an adjective (the left door), a verb (they left the room), or a noun (they are part of the left). Each of them will have different contexts, but you won't be able to distinguish them automatically during processing, so the similarity values for words with multiple meanings will be somewhat 'smudged'. And some words will have identical contexts, eg names: (I went to X on holiday -- X can be any country/city/location).
It also will probably not work very well with antonyms, as they often occur in the same context: this door is open/closed/locked/unlocked, this book is too easy/difficult etc. But it might; hard to tell without actually trying it out. One thing that does work well are closed categories, such as days of the week or months.
While this can be done in a purely symbolic way, I think this is also the same principle that is used by embeddings in deep learning algorithms, where words are represented by context vectors.
Solution 2:[2]
I do not really understand what you exactly mean with similarity especially if you want to talk about meaning. You would need a dataset to denote meaning unto words. A popular example of this would be sentiment analysis. If you got a lot of textual data, say tweets from twitter, you might want to know if the data is mostly positive or negative. To do this you would find a dataset of similar nature who has labelled the data already into categories. Then you can use this data to classify the texts into categories (e.g with a Naive Bayes classifier). In this way you can denote meaning on texts computationally.
This would allow general evaluations but also evaluations on an input to input basis on how well they scored across different categories of meaning.
I'm not sure if that's what you're looking for in an answer though.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Oliver Mason |
| Solution 2 | cinderashes |
