'Applying a distinct operation within each rdd

I have some very large pyspark dataframes with many duplicate rows. However, in my use case, the shuffle to do a full distinct() is not worth it in time cost. Instead, I wish to just apply distinct within each partition. I can't figure out how to do this, however. I've tried:

>>> spark = SparkSession.builder.appName('foobar').getOrCreate()
>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
>>> df = spark.createDataFrame(data)
>>> df.rdd.mapPartitions(lambda p: p.distinct()).collect()

and

>>> spark = SparkSession.builder.appName('foobar').getOrCreate()
>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
>>> df = spark.createDataFrame(data)
>>> df.foreachPartition(lambda p: p.distinct()).collect()

but in both cases I get

AttributeError: 'itertools.chain' object has no attribute 'distinct'

Would you please have advice on how to achieve this?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Applying a distinct operation within each rdd

Sources

Related Questions