'How to find duplicates in a list?

I have a list of unsorted integers and I want to find those elements which have duplicates.

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)

I can find the distinct elements of the set with dup.distinct, so I wrote my answer as follows.

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
val distinct = dup.distinct
val elementsWithCounts = distinct.map( (a:Int) => (a, dup.count( (b:Int) => a == b )) )
val duplicatesRemoved = elementsWithCounts.filter( (pair: Pair[Int,Int]) => { pair._2 <= 1 } )
val withDuplicates = elementsWithCounts.filter( (pair: Pair[Int,Int]) => { pair._2 > 1 } )

Is there an easier way to solve this?



Solution 1:[1]

Try this:

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup.groupBy(identity).collect { case (x, List(_,_,_*)) => x }

The groupBy associates each distinct integer with a list of its occurrences. The collect is basically map where non-matching elements are ignored. The match pattern following case will match integers x that are associated with a list that fits the pattern List(_,_,_*), a list with at least two elements, each represented by an underscore since we don't actually need to store those values (and those two elements can be followed by zero or more elements: _*).

You could also do:

dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => x }

It's much faster than the approach you provided since it doesn't have to repeatedly pass over the data.

Solution 2:[2]

A bit late to the party, but here's another approach:

dup.diff(dup.distinct).distinct

diff gives you all the extra items above those in the argument (dup.distinct), which are the duplicates.

Solution 3:[3]

Another approach is to use foldLeft and do it the hard way.

We start with two empty sets. One is for elements that we have seen at least once. The other for elements that we have seen at least twice (aka duplicates).

We traverse the list. When the current element has already been seen (seen(cur)) it is a duplicate and therefore added to duplicates. Otherwise we add it to seen. The result is now the second set that contains the duplicates.

We can also write this as a generic method.

def dups[T](list: List[T]) = list.foldLeft((Set.empty[T], Set.empty[T])){ case ((seen, duplicates), cur) => 
  if(seen(cur)) (seen, duplicates + cur) else (seen + cur, duplicates)      
}._2

val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)

dups(dup) //Set(1,5,101)

Solution 4:[4]

Summary: I've written a very efficient function which returns both List.distinct and a List consisting of each element which appeared more than once and the index at which the element duplicate appeared.

Details: If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List (as ordering was significant) exactly once and returns a Tuple2 consisting of the original List deduped (all duplicates after the first are removed; i.e. the same as invoking distinct) and a second List showing each duplicate and an Int index at which it occurred within the original List.

I have implemented the function twice based on the general performance characteristics of the Scala collections; filterDupesL (where the L is for Linear) and filterDupesEc (where the Ec is for Effectively Constant).

Here's the "Linear" function:

def filterDupesL[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(
      remaining: List[A]
    , index: Int =
        0
    , accumulator: (List[A], List[(A, Int)]) =
        (Nil, Nil)): (List[A], List[(A, Int)]
  ) =
    if (remaining.isEmpty)
      accumulator
    else
      recursive(
          remaining.tail
        , index + 1
        , if (accumulator._1.contains(remaining.head)) //contains is linear
          (accumulator._1, (remaining.head, index) :: accumulator._2)
        else
          (remaining.head :: accumulator._1, accumulator._2)
      )
  val (distinct, dupes) = recursive(items)
  (distinct.reverse, dupes.reverse)
}

An below is an example which might make it a bit more intuitive. Given this List of String values:

val withDupes =
  List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")

...and then performing the following:

val (deduped, dupeAndIndexs) =
  filterDupesL(withDupes)

...the results are:

deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))

And if you just want the duplicates, you simply map across dupeAndIndexes and invoke distinct:

val dupesOnly =
  dupeAndIndexs.map(_._1).distinct

...or all in a single call:

val dupesOnly =
  filterDupesL(withDupes)._2.map(_._1).distinct

...or if a Set is preferred, skip distinct and invoke toSet...

val dupesOnly2 =
  dupeAndIndexs.map(_._1).toSet

...or all in a single call:

val dupesOnly2 =
  filterDupesL(withDupes)._2.map(_._1).toSet

For very large Lists, consider using this more efficient version (which uses an additional Set to change the contains check in effectively constant time):

Here's the "Effectively Constant" function:

def filterDupesEc[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(
      remaining: List[A]
    , index: Int =
        0
    , seenAs: Set[A] =
        Set()
    , accumulator: (List[A], List[(A, Int)]) =
        (Nil, Nil)): (List[A], List[(A, Int)]
  ) =
    if (remaining.isEmpty)
      accumulator
    else {
      val (isInSeenAs, seenAsNext) = {
        val isInSeenA =
          seenAs.contains(remaining.head) //contains is effectively constant
        (
            isInSeenA
          , if (!isInSeenA)
              seenAs + remaining.head
            else
              seenAs
        )
      }
      recursive(
          remaining.tail
        , index + 1
        , seenAsNext
        , if (isInSeenAs)
          (accumulator._1, (remaining.head, index) :: accumulator._2)
        else
          (remaining.head :: accumulator._1, accumulator._2)
      )
    }
  val (distinct, dupes) = recursive(items)
  (distinct.reverse, dupes.reverse)
}

Both of the above functions are adaptations of the filterDupes function in my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._.

Solution 5:[5]

def findDuplicates[T](seq: Seq[T]): Seq[T] = {
  seq.groupMapReduce(identity)(_ => false)((_, _) => true).filter(_._2).keys.toSeq
}

We start by associating every element with false (map phase), and as soon as a duplicate is found, we associate it with true (reduce phase). We leverage the fact that for each element reduce phase is done only if that element is a duplicate. Then we filter keeping only elements associated with true (filter phase).

This works with all implementations of trait Seq.

The time and space complexity are both O(n).

Comparing with other answers:

  1. @dhg answer does not work with all kinds of sequences. E.g. it works with List but will produce incorrect results for Array (because List pattern matching will not work on Array). Also, although it has the same time and space complexity, it creates a List for each element containing all the duplicates for the respective element, only to check that the list has more than one element. That means that the memory footprint and runtime would be higher.
  2. @Luigi Plinge answer is nice and elegant. However, it creates an intermediate hashmap-like datastructure 3 times (1 for diff and 2 for distinct), so the runtime is likely to be higher.

Solution 6:[6]

I'm joining one liner party.

How about this:

(myDupList zip myDupList.tail).filter((m,n) => (m==n)).map((m,n) => m)

Solution 7:[7]

My favorite is

def hasDuplicates(in: List[Int]): Boolean = {
  val sorted = in.sortWith((i, j) => i < j)
  val zipped = sorted.tail.zip(sorted)
  zipped.exists(p => p._1 == p._2)
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Luigi Plinge
Solution 3 Kigyo
Solution 4
Solution 5 SlavaSt
Solution 6 Rohlik
Solution 7 JNYRanger