'Scala: Window.partitionBy() and determine maximum count over joined dataframes
I have data that is being read in from two separate .csv's with the following headers:
df1:
ID1 (integer), ID2 (integer), person_count (integer), distance (float), amount (float), payment_type (integer), datetime_origin (string), datetime_dest (string)
df2:
ID1 (integer), city (string), zone (string), zone_service (string)
For each hour of the day, find the zone (where city = "LA") with the LARGEST number of pickups.
I have tried the following, but it's not yielding the results I am expecting and I think its because of my lack of understanding on using Window.partitionBy() correctly:
var df_zones = df2.filter($"city"==="LA").drop("service_zone")
var df_hour = df1.withColumn("hour", hour(col("datetime_origin")))
var windowSpec = Window.partitionBy("hour")
var df_Sum = df_hour.withColumn("sum", sum(col("ID1")).over(windowSpec))
var df_comb = df_zones.join(df_PUSum, df_zones("ID1")===df_Sum("ID1"),"inner").groupBy("zone", "hour").count().orderBy(desc("count"), desc("hour")).show()
Any feedback on how to solve this? Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
