A comparative study of two loss functions
1. Introduction
Recommendation systems are a cornerstone of modern digital platforms, designed to deliver personalized content to users. Traditionally, these systems have relied heavily on positive user feedback. However, as recommendation systems have scaled and become more sophisticated, incorporating negative preferences has emerged as a crucial step for several reasons:
1. Enhanced Personalization: By modeling user embeddings based on sequences of liked or consumed items, including negative preferences allows for more refined personalization. Negative feedback helps capture what users explicitly dislike, improving the system’s ability to tailor recommendations.
2. Content Proliferation and Cold Start Challenges: With the rapid growth of content, traditional positive-feedback-based systems often struggle to differentiate between older items with low or negative user engagement and newer items that suffer from a lack of interactions (cold start). Incorporating negative feedback helps distinguish between these two categories, ensuring better handling of both.
3. Differentiation in Similar Item Pools: When items are served from a shared pool for all users or large groups of users, negative feedback provides critical insights into user-specific preferences. This is particularly useful in capturing geographical/cultural/interest-based differences among users.
This renewed interest in leveraging negative user feedback has spurred the development of innovative approaches to recommendation modeling. One such progress in exploring a different type of loss function called the Not-to-Recommend (NTR) loss(https://arxiv.org/pdf/2308.12256) — which explicitly incorporates negative preferences into the model (source). In this post, we compare this new loss type with traditional Sampled Softmax (SSM) loss (https://arxiv.org/html/2201.02327v2) — which treats negative preference items as negatives (source). Below, we explore their mechanisms and evaluate their performance in different scenarios.
2. Loss Functions
Sampled Softmax (SSM):
Not-to-Recommend (NTR):
Next we compare how these losses differ for different types of content.
3. Comparative Analysis
Case 1: Negative items with higher probability (maybe because of overall popularity or occurance) than positive items:
• For popular negative items, NTR loss imposes a significant penalty for recommending the negative item ( is large). This ensures the system strongly discourages recommending such items for this specific user.
• In contrast, SSM loss focuses on relative probabilities between positive and negative items. While it penalizes the model for assigning higher probabilities to negatives, its impact is less pronounced compared to NTR in case the popular item is also very positive.
Case 2: Cold start items vs Negative items
NTR:
SSM:
For cold-start items, both p+ and alpha are small, so the relative contribution of this item to loss in the batch would be minimal in both scenarios
However for negative preference items, the NTR loss significantly penalizes the model for recommending the item compared to SSM loss(given that it would take a really large p_high to actually affect the denominator), hence it is able to significantly reduce the model’s propensity to recommend such items in future.
Both Not-to-Recommend (NTR) loss and Sampled Softmax (SSM) loss have their strengths and are better suited for different scenarios in recommendation systems. Let’s examine when each might be preferable:
4. Scenarios favoring NTR Loss
• Explicit Negative Feedback: When users provide clear negative feedback (e.g., dislikes, low ratings), NTR loss can directly incorporate this information, leading to more accurate recommendations by explicitly learning what not to recommend.
• Safety-Critical Applications: In domains where recommending inappropriate items could have serious consequences (e.g., content for minors, health-related recommendations), NTR’s ability to explicitly avoid certain items is valuable.
• Diverse Recommendations: NTR can help prevent filter bubbles by actively pushing away from consistently disliked or inappropriate content, potentially leading to more diverse recommendations.
5. Scenarios favoring SSM Loss
• Large Item Catalogs: For systems with millions of items, SSM’s efficiency in handling large-scale datasets makes it more practical and computationally more efficient.
• Implicit Feedback Scenarios: When dealing primarily with implicit feedback (e.g., clicks, views), SSM can effectively learn from positive interactions without requiring explicit negative feedback.
• Ranking Optimization: SSM naturally aligns with ranking metrics, making it effective for optimizing top-K recommendations.
6. Considerations for Both
• Data Availability: NTR requires both positive and negative feedback, while SSM can work with just positive interactions. Choose based on your available data.
• Computational Resources: SSM is generally more computationally efficient, especially for large-scale systems.
• User Experience Goals: If the primary aim is to avoid bad recommendations, NTR might be preferable. If it’s to surface the best possible recommendations, SSM could be more suitable.
7. Conclusion
Both Not-to-Recommend (NTR) loss and Sampled Softmax (SSM) loss offer unique strengths tailored to different recommendation scenarios. NTR excels in explicitly handling negative feedback and avoiding undesirable recommendations, while SSM is computationally efficient and effective in ranking optimization tasks. The choice between these approaches should be guided by the specific requirements of your recommendation system—such as data availability, computational constraints, and desired user experience outcomes.
By carefully evaluating these factors, you can select the loss function that best aligns with your system’s goals and constraints.
Thanks for the summary of NTR and SSM loss functions!
There are a few technical details that might worth more clarification:
1. Notation of the SSM loss.
Currently, the sampled softmax (SSM) loss is written as
L_{SSM} = -\sum_{i \in D_{pos}} log[p(y_i | s_i) / (p(y_i | s_i) + \sum_{j \in S_i} p((j | s_i)]
Here, it uses probability terms (e.g. p(y_i | s_i)) in the numerator and denominator of the softmax function. However, I think these probability terms should be replaced by exp(z_i) to align with the definition of softmax formula, where z_i is the logit term (i.e. unnormalized log probabilities) and usually in the form of z_i = s_i * v_{y_i}, which is essentially the outputs of the last layer of the neural network right before the softmax layer. Please see the equation in Section 3.1 of "Covington et al. Deep Neural Networks for YouTube Recommendations" for more details.
Note that this has important implications: the logit term z_i can take any real value between [-inf, inf], and exp(z_i) is in the range of [0, +inf]. However, p(y_i | s_i) by definition is in the range of [0, 1], and convey very different meanings compared to exp(z_i). Essentially what softmax is doing is that it exponentiates the logit of each candidate class, and then divide the target class's exp(z_i) against the summed exp logits from all classes, which gives the target class's probability.
Now we have the softmax probability of the target class, we can derive the softmax cross-entropy loss, which is the negative log likelihood of the softmax probability, i.e. -log(p), where p = exp(s_i*v_{y_i}) / \sum_{k \in V} exp(s_i*v_k). Taking a sum of the loss term across all training examples, we get the softmax cross-entropy loss of:
L = -\sum_{i \in D_{pos}} log[exp(s_i*v_{y_i}) / \sum_{k \in V} exp(s_i*v_k)]
The only difference between sampled softmax (SSM) vs. normal softmax is that the denominator of sampled softmax has less number of terms (optionally with some additional correction factors).
2. Relationship between SSM and NTR loss.
They're not mutually exclusive, and often used together. E.g. we can either use only SSM loss, or SSM + NTR loss.
(1) the p terms we see in NTR loss are exactly the softmax probabilities we discussed above
(2) since the softmax loss is just -log(p) where p is the softmax probability, the first half of the NTR loss equation (which samples from D_pos) is exactly SSM loss. In retrieval models, what this means is that, for positive labels, we use the SSM loss (i.e. -log(p)), for negative labels, we use the 2nd half of the NTR loss (i.e. -log(1-p)).
3. The context of retrieval vs. ranking when discussing loss functions.
The paper "Learning from Negative User Feedback and Measuring Responsiveness for Sequential Recommenders" is discussing the NTR loss in the context of retrieval models, where the objective is to find 1 or top-K items from a large corpus (vocabulary of millions or above). Retrieval models typically follows a multi-class classification setup, which is where softmax cross-entropy loss naturally fits in, and NTR is also developed in this multi-class classification setup.
I haven't read the "On the Effectiveness of Sampled Softmax Loss for Item Recommendation" in full details, but it seems to be talking about ranking models (correct me if I'm wrong)? I think the differences in retrieval vs. ranking objectives should also be clarified when evaluating loss functions.