'Scala: Window.partitionBy() and determine maximum count over joined dataframes

I have data that is being read in from two separate .csv's with the following headers:

df1:

ID1 (integer), ID2 (integer), person_count (integer), distance (float), amount (float), payment_type (integer), datetime_origin (string), datetime_dest (string)

df2:

ID1 (integer), city (string), zone (string), zone_service (string)

For each hour of the day, find the zone (where city = "LA") with the LARGEST number of pickups.

I have tried the following, but it's not yielding the results I am expecting and I think its because of my lack of understanding on using Window.partitionBy() correctly:

var df_zones = df2.filter($"city"==="LA").drop("service_zone")

var df_hour = df1.withColumn("hour", hour(col("datetime_origin")))

var windowSpec = Window.partitionBy("hour")
var df_Sum = df_hour.withColumn("sum", sum(col("ID1")).over(windowSpec))

var df_comb = df_zones.join(df_PUSum, df_zones("ID1")===df_Sum("ID1"),"inner").groupBy("zone", "hour").count().orderBy(desc("count"), desc("hour")).show()

Any feedback on how to solve this? Thanks!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Scala: Window.partitionBy() and determine maximum count over joined dataframes

Sources

Related Questions