'"Pure" Map-reduce shuffle in pyspark df
assuming i have some pyspark df, f.e:
Key | Value
0 | "a"
2 | "c"
0 | "b"
1 | "z"
I want to perform map-reduce-like shuffle method -
i.e. I want to group rows on partitions by keys.
I believe df.rdd.groupByKey() does that, but it changes df structure
- it returns list of tuples with list as value (grouped key).
How can I perform "pure" shuffle function - Move my objects to specific partition, but do not change anything in df appearance / structure?
So the output would be the same but partitioning would be different. For example - we start with 2 paritions:
(0,"a")
(1,"c")
(1,"d")
and
(1,"d")
(0:"e")
(1,"w")
as a result we get two partitions:
(0,"a")
(0:"e")
and
(1,"d")
(1,"c")
(1,"d")
(1,"w")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
