'BigQuery DENSE_RANK

I posted something similar to this but without the full code and I don't think people understood what I was looking for...

I'm writing a query to examine bus stops, the expected time the bus was supposed to arrive, and the actual time it did (as well as some other factors, like date, trip id, line name, etc.)

What I'm running into in the data is that there are some cases where the datetime is repeated for multiple stops on the time_recorded column. I want to eliminate the lines of data where the time stamp repeats, my thought process was to use a DENSE_RANK , the code looks like this:

SELECT 
  route_id,  
  trip_id,
  trip_date,  
  aimed_arrival_time,
CASE
    when (datetime_diff(time_recorded, aimed_arrival_time, SECOND)/60) >= 1400 THEN date_add(time_recorded, INTERVAL 1 day)
    when (datetime_diff(time_recorded, aimed_arrival_time, SECOND)/60) <= -1400 THEN date_add(time_recorded, INTERVAL 1 day)
    ELSE time_recorded
    END
  AS time_recorded_adj, 
 DENSE_RANK() OVER(PARTITION BY trip_id, trip_date ORDER BY time_recorded) as stop_time_order

FROM 
  `bigquery-public-data.san_francisco_transit_muni.stop_monitoring` 
WHERE
  time_recorded >= '2020-01-01'
GROUP BY
  route_id, 
 trip_id, 
 trip_date,  
 aimed_arrival_time,  
 time_recorded
ORDER BY  
 trip_id,
 trip_date,  
 aimed_arrival_time

And it returns data like this:

trip_id expect_arrival_time recorded_arrival_time stop_ranks
10001444 2021-06-20 22:14:19 UTC 2021-06-20 22:15:52 UTC 1
10001444 2021-06-20 22:17:07 UTC 2021-06-20 22:19:52 UTC 2
10001444 2021-06-20 22:19:00 UTC 2021-06-20 22:19:52 UTC 2
10001444 2021-06-20 22:32:26 UTC 2021-06-20 22:31:59 UTC 3

You can see that for stops 2 and 3, the recorded_arrival_time is exactly the same - which is obviously impossible, as the stop is roughly 2 minutes apart.

I'm interested in taking a look at the data with the repeated lines removed - so in this example where the DENSE_RANK() returns 1 , 2 , 2 , 3, I'd like to eliminate the second line (where the recorded_arrival_time repeats) so that I'm left with this:

trip_id expect_arrival_time recorded_arrival_time stop_ranks
10001444 2021-06-20 22:14:19 UTC 2021-06-20 22:15:52 UTC 1
10001444 2021-06-20 22:17:07 UTC 2021-06-20 22:19:52 UTC 2
10001444 2021-06-20 22:32:26 UTC 2021-06-20 22:31:59 UTC 3

Any way I could do this?



Solution 1:[1]

You could use a cte from your query as basis and then Group by the column taht are equal.

Like

WITH CTE AS (SELECT 
  route_id,  
  trip_id,
  trip_date,  
  aimed_arrival_time,
CASE
    when (datetime_diff(time_recorded, aimed_arrival_time, SECOND)/60) >= 1400 THEN date_add(time_recorded, INTERVAL 1 day)
    when (datetime_diff(time_recorded, aimed_arrival_time, SECOND)/60) <= -1400 THEN date_add(time_recorded, INTERVAL 1 day)
    ELSE time_recorded
    END
  AS time_recorded_adj, 
 DENSE_RANK() OVER(PARTITION BY trip_id, trip_date ORDER BY time_recorded) as stop_time_order

FROM 
  `bigquery-public-data.san_francisco_transit_muni.stop_monitoring` 
WHERE
  time_recorded >= '2020-01-01'
GROUP BY
  route_id, 
 trip_id, 
 trip_date,  
 aimed_arrival_time,  
 time_recorded
ORDER BY  
 trip_id,
 trip_date,  
 aimed_arrival_time)
 SELECT  
    route_id, trip_id, trip_date, MIN(aimed_arrival_time), time_recorded_adj ,stop_time_order
    FROM CTE
    GROUP BY route_id, trip_id, trip_date, time_recorded_adj ,stop_time_order

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 nbk