'Using annotate and distinct(field) together in Django

I've got a bunch of reviews in my app. Users are able to "like" reviews.

I'm trying to get the most liked reviews. However, there are some popular users on the app, and all their reviews have the most likes. I want to only select one review (ideally the most liked one) per user.

Here are my objects,

class Review(models.Model):
    user = models.ForeignKey(User, on_delete=models.CASCADE, related_name='review_user', db_index=True)
    review_text = models.TextField(max_length=5000)
    rating = models.SmallIntegerField(
        validators=[
            MaxValueValidator(10),
            MinValueValidator(1),
        ],
    )
    date_added = models.DateTimeField(db_index=True)
    review_id = models.AutoField(primary_key=True, db_index=True)

class LikeReview(models.Model):
    user = models.ForeignKey(User, on_delete=models.CASCADE, related_name='likereview_user', db_index=True)
    review = models.ForeignKey(Review, on_delete=models.CASCADE, related_name='likereview_review', db_index=True)
    date_added = models.DateTimeField()

    class Meta:
        unique_together = [['user', 'review']]

And here's what I currently have to get the most liked reviews:

reviews = Review.objects.filter().annotate(
    num_likes=Count('likereview_review')
).order_by('-num_likes').distinct()

As you can see, the reviews I get will be sorted by the most likes, but its possible that the top liked reviews are all by the same user. I want to add distinct('user') here but I get annotate() + distinct(fields) is not implemented.

How can I accomplish this?



Solution 1:[1]

This will be a bit badly readable because of your related names. I would suggest to change Review.user.related_name to reviews, it will make this much more understandable, but I've elaborated on that in the second part of the answer.

With your current setup, I managed to do it fully in the DB using subqueries:

from django.db.models import Subquery, OuterRef, Count

# No DB Queries
best_reviews_per_user = Review.objects.all()\
    .annotate(num_likes=Count('likereview_review'))\
    .order_by('-num_likes')\
    .filter(user=OuterRef('id'))

# No DB Queries
review_sq = Subquery(best_reviews_per_user.values('review_id')[:1])

# First DB Query
best_review_ids = User.objects.all()\
    .annotate(best_review_id=review_sq)\
    .values_list('best_review_id', flat=True)

# Second DB Query
best_reviews = Review.objects.all()\
    .annotate(num_likes=Count('likereview_review'))\
    .order_by('-num_likes')\
    .filter(review_id__in=best_review_ids)\
    .exclude(num_likes=0)  # I assume this is the case


# Print it
for review in best_reviews:
    print(review, review.num_likes, review.user)

# Test it
assert len({review.user for review in best_reviews}) == len(best_reviews)
assert sorted([r.num_likes for r in best_reviews], reverse=True) == [r.num_likes for r in best_reviews]
assert all([r.num_likes for r in best_reviews])

Let's try with this completely equivalent model structure:

from django.db import models
from django.utils import timezone


class TimestampedModel(models.Model):
    """This makes your life much easier and is pretty DRY"""
    created = models.DateTimeField(default=timezone.now)
    class Meta:
        abstract = True


class Review(TimestampedModel):
    user = models.ForeignKey(User, on_delete=models.CASCADE, related_name='reviews', db_index=True)
    text = models.TextField(max_length=5000)
    rating = models.SmallIntegerField()
    likes = models.ManyToManyField(User, through='ReviewLike')


class ReviewLike(TimestampedModel):
    user = models.ForeignKey(User, on_delete=models.CASCADE, db_index=True)
    review = models.ForeignKey(Review, on_delete=models.CASCADE, db_index=True)

The likes are a clear m2m relationship between reviews and users, with an extra timestamp column - it's a model use for a Through model. Docs here.

Now everything is imho much much easier to read.

from django.db.models import OuterRef, Count, Subquery


# No DB Queries
best_reviews = Review.objects.all()\
    .annotate(like_count=Count('likes'))\
    .exclude(like_count=0)\
    .order_by('-like_count')\

# No DB Queries
sq = Subquery(best_reviews.filter(user=OuterRef('id')).values('id')[:1])

# First DB Query
user_distinct_best_review_ids = User.objects.all()\
    .annotate(best_review=sq)\
    .values_list('best_review', flat=True)

# Second DB Query
best_reviews = best_reviews.filter(id__in=user_distinct_best_review_ids).all()

Solution 2:[2]

One way of doing it is as follows:

  1. Get a list of tuples that represent the user.id and review.id, ordered by user and number of likes ASCENDING
  2. Convert the list to a dict to remove duplicate user.ids. Later items replace earlier ones, which is why the ordering in step 1 is important
  3. Create a list of review.ids from the values in the dict
  4. Get a queryset using the list of review.ids, ordered by the number of likes DESCENDING
from django.db.models import Count

user_review_list = Review.objects\
    .annotate(num_likes=Count('likereview_review'))\
    .order_by('user', 'num_likes')\
    .values_list('user', 'pk')

user_review_dict = dict(user_review_list)
review_pk_list = list(user_review_dict.values())

reviews = Review.objects\
    .annotate(num_likes=Count('likereview_review'))\
    .filter(pk__in=review_pk_list)\
    .order_by('-num_likes')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2