'Create multiple columns over the same window
The following code is pretty slow.
Is there a way of creating multiple columns at once over the same window, so Spark does not need to partition and order the data multiple times?
w = Window().partitionBy("k").orderBy("t")
df = df.withColumn(F.col("a"), F.last("a",True).over(w))
df = df.withColumn(F.col("b"), F.last("b",True).over(w))
df = df.withColumn(F.col("c"), F.last("c",True).over(w))
...
Solution 1:[1]
I'm not sure that Spark does partitioning and reordering several times, as you use the same window consecutively. However, .select is usually a better alternative than .withColumn.
df = df.select(
"*",
F.last("a", True).over(w).alias("a"),
F.last("b", True).over(w).alias("b"),
F.last("c", True).over(w).alias("c"),
)
To find out if partitioning and ordering is done several times, you need to analyse the df.explain() results.
Solution 2:[2]
You dont have to generate one column at a time. Use list comprehension. Code below
new=['a','b','c']
df = df.select(
"*", *[F.last(x, True).over(w).alias(f"{x}") for x in new]
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ZygD |
| Solution 2 | wwnde |
