'How do you handle positive and negative samples when they are the same in word2vec's negative sampling?
I have a question about doing negative sampling in word2vec. In order to solve the binary classification problem, I know that words around the target word are labeled positive and words in other ranges are labeled negative. At this time, there are too many negative words. So, negative sampling samples according to the frequency of words appearing in the entire document.
In this case, if the words in positive are sampled in the same negative, how is it processed?
For example, if you look at the target word love in "I love pizza", (love, pizza) should have a positive label. But isn't it possible for (love, pizza) to have a negative label through negative sampling again in the sentence afterwards?
Solution 1:[1]
The negative-sampling isn't done based on words elsewhere in the sentence, but the word-frequencies across the entire training corpus.
That is, while the observed neighbors become positive examples, the negative examples are chosen at random from the entire distribution of all known words.
These negative examples might even, by bad luck, be the same as other nearby in-context-window intended-positive words! But this isn't fatal: across all training examples, across all passes, across all negative-samples drawn, the net effect of training remains roughly the same: the positive-examples have their predictions very-often up-weighted, the negative-examples down-weighted, and even occasionally where the negative-example is sometimes also a neighbor, that effect "washes out" across all the (non-biasing) errors, still leaving that real neighbor net-reinforced over the full training run.
So: you're right to note that the process seems a little sloppy, but it all evens out in the end.
FWIW, the word2vec implementations with which I'm most familiar do check if the negative-sample is exactly the positive word we're trying to predict, but don't do any other checking against the whole current window.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | gojomo |
