'"Pure" Map-reduce shuffle in pyspark df

assuming i have some pyspark df, f.e:

Key | Value
0   | "a"
2   | "c"
0   | "b"
1   | "z"

I want to perform map-reduce-like shuffle method -
i.e. I want to group rows on partitions by keys. I believe df.rdd.groupByKey() does that, but it changes df structure

  • it returns list of tuples with list as value (grouped key).

How can I perform "pure" shuffle function - Move my objects to specific partition, but do not change anything in df appearance / structure?

So the output would be the same but partitioning would be different. For example - we start with 2 paritions:

(0,"a")
(1,"c")
(1,"d")

and

(1,"d")
(0:"e")
(1,"w")

as a result we get two partitions:

(0,"a")
(0:"e")

and

(1,"d")
(1,"c")
(1,"d")
(1,"w")


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source