'Applying a distinct operation within each rdd
I have some very large pyspark dataframes with many duplicate rows. However, in my use case, the shuffle to do a full distinct() is not worth it in time cost. Instead, I wish to just apply distinct within each partition. I can't figure out how to do this, however. I've tried:
>>> spark = SparkSession.builder.appName('foobar').getOrCreate()
>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
>>> df = spark.createDataFrame(data)
>>> df.rdd.mapPartitions(lambda p: p.distinct()).collect()
and
>>> spark = SparkSession.builder.appName('foobar').getOrCreate()
>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
>>> df = spark.createDataFrame(data)
>>> df.foreachPartition(lambda p: p.distinct()).collect()
but in both cases I get
AttributeError: 'itertools.chain' object has no attribute 'distinct'
Would you please have advice on how to achieve this?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
