1 Comment

Thanks for the summary of NTR and SSM loss functions!

There are a few technical details that might worth more clarification:

1. Notation of the SSM loss.

Currently, the sampled softmax (SSM) loss is written as

L_{SSM} = -\sum_{i \in D_{pos}} log[p(y_i | s_i) / (p(y_i | s_i) + \sum_{j \in S_i} p((j | s_i)]

Here, it uses probability terms (e.g. p(y_i | s_i)) in the numerator and denominator of the softmax function. However, I think these probability terms should be replaced by exp(z_i) to align with the definition of softmax formula, where z_i is the logit term (i.e. unnormalized log probabilities) and usually in the form of z_i = s_i * v_{y_i}, which is essentially the outputs of the last layer of the neural network right before the softmax layer. Please see the equation in Section 3.1 of "Covington et al. Deep Neural Networks for YouTube Recommendations" for more details.

Note that this has important implications: the logit term z_i can take any real value between [-inf, inf], and exp(z_i) is in the range of [0, +inf]. However, p(y_i | s_i) by definition is in the range of [0, 1], and convey very different meanings compared to exp(z_i). Essentially what softmax is doing is that it exponentiates the logit of each candidate class, and then divide the target class's exp(z_i) against the summed exp logits from all classes, which gives the target class's probability.

Now we have the softmax probability of the target class, we can derive the softmax cross-entropy loss, which is the negative log likelihood of the softmax probability, i.e. -log(p), where p = exp(s_i*v_{y_i}) / \sum_{k \in V} exp(s_i*v_k). Taking a sum of the loss term across all training examples, we get the softmax cross-entropy loss of:

L = -\sum_{i \in D_{pos}} log[exp(s_i*v_{y_i}) / \sum_{k \in V} exp(s_i*v_k)]

The only difference between sampled softmax (SSM) vs. normal softmax is that the denominator of sampled softmax has less number of terms (optionally with some additional correction factors).

2. Relationship between SSM and NTR loss.

They're not mutually exclusive, and often used together. E.g. we can either use only SSM loss, or SSM + NTR loss.

(1) the p terms we see in NTR loss are exactly the softmax probabilities we discussed above

(2) since the softmax loss is just -log(p) where p is the softmax probability, the first half of the NTR loss equation (which samples from D_pos) is exactly SSM loss. In retrieval models, what this means is that, for positive labels, we use the SSM loss (i.e. -log(p)), for negative labels, we use the 2nd half of the NTR loss (i.e. -log(1-p)).

3. The context of retrieval vs. ranking when discussing loss functions.

The paper "Learning from Negative User Feedback and Measuring Responsiveness for Sequential Recommenders" is discussing the NTR loss in the context of retrieval models, where the objective is to find 1 or top-K items from a large corpus (vocabulary of millions or above). Retrieval models typically follows a multi-class classification setup, which is where softmax cross-entropy loss naturally fits in, and NTR is also developed in this multi-class classification setup.

I haven't read the "On the Effectiveness of Sampled Softmax Loss for Item Recommendation" in full details, but it seems to be talking about ranking models (correct me if I'm wrong)? I think the differences in retrieval vs. ranking objectives should also be clarified when evaluating loss functions.

Expand full comment