'Cross Entropy function implemented with Ground Truth probability vs Ground Truth on-hot coded vector

Hi I came across a documentation in Pytorch which implement cross-entropy loss function in two ways:

# Example of target with class indices
loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
output = loss(input, target)
output.backward()

# Example of target with class probabilities
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5).softmax(dim=1)
output = loss(input, target)
output.backward()

One method uses the probability vector of the target, and the other uses it as a one-hot vector. To me, the implementation with class probabilities is closer to the definition of the loss function, but in most places, I have seen the other method. Can someone clarify the difference between these methods?

Thanks



Solution 1:[1]

I think there are two things worth explaining. From what I understood, you are asking for the following:

  • the difference between soft and hard labels.

  • As well as the difference between the one-hot and dense encoding of hard labels.

Just to be clear a probability distribution determines how probabilities are distributed over the values of the random variable (here they correspond to the different classes for your classification task). A soft label represents the distribution itself, while a hard label represents the value of the random variable that has the highest probability (i.e. the most probable class).

If you're applying the loss function against a target that represents a probability distribution, its values are positive and sum to one, then this would most likely correspond to a pseudo-label. In practice, this means you are applying a soft cross-entropy loss and supervising the whole distribution explicitly.

Now, hards labels are what you can expect to have when using ground-truth annotations. You can choose to represent a label in one of two ways (what is called the encoding of the label):

  • either with a one-hot encoding vector, all 0s, and a single 1 for the true class
  • or a dense representation which is the index of the true class.

With hard labels, only the probability of true class (the class corresponding to the ground-truth information) is explicitly supervised: you are pushing to maximize the probability mass it gets predicted with. Implicitly minimizing the mass of all other labels... since the mass is finite (i.e. equal to 1).

In PyTorch, the utility provided by nn.CrossEntropyLoss expects dense labels for the target vector. Tensorflow's implementation on the other hand allows you to provide targets as one-hot encoding. This let's you apply the function not only with one-hot-encodings (as intended for classical classification tasks), but also soft target...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ivan