'BigQuery DENSE_RANK
I posted something similar to this but without the full code and I don't think people understood what I was looking for...
I'm writing a query to examine bus stops, the expected time the bus was supposed to arrive, and the actual time it did (as well as some other factors, like date, trip id, line name, etc.)
What I'm running into in the data is that there are some cases where the datetime is repeated for multiple stops on the time_recorded column. I want to eliminate the lines of data where the time stamp repeats, my thought process was to use a DENSE_RANK , the code looks like this:
SELECT
route_id,
trip_id,
trip_date,
aimed_arrival_time,
CASE
when (datetime_diff(time_recorded, aimed_arrival_time, SECOND)/60) >= 1400 THEN date_add(time_recorded, INTERVAL 1 day)
when (datetime_diff(time_recorded, aimed_arrival_time, SECOND)/60) <= -1400 THEN date_add(time_recorded, INTERVAL 1 day)
ELSE time_recorded
END
AS time_recorded_adj,
DENSE_RANK() OVER(PARTITION BY trip_id, trip_date ORDER BY time_recorded) as stop_time_order
FROM
`bigquery-public-data.san_francisco_transit_muni.stop_monitoring`
WHERE
time_recorded >= '2020-01-01'
GROUP BY
route_id,
trip_id,
trip_date,
aimed_arrival_time,
time_recorded
ORDER BY
trip_id,
trip_date,
aimed_arrival_time
And it returns data like this:
| trip_id | expect_arrival_time | recorded_arrival_time | stop_ranks |
|---|---|---|---|
| 10001444 | 2021-06-20 22:14:19 UTC | 2021-06-20 22:15:52 UTC | 1 |
| 10001444 | 2021-06-20 22:17:07 UTC | 2021-06-20 22:19:52 UTC | 2 |
| 10001444 | 2021-06-20 22:19:00 UTC | 2021-06-20 22:19:52 UTC | 2 |
| 10001444 | 2021-06-20 22:32:26 UTC | 2021-06-20 22:31:59 UTC | 3 |
You can see that for stops 2 and 3, the recorded_arrival_time is exactly the same - which is obviously impossible, as the stop is roughly 2 minutes apart.
I'm interested in taking a look at the data with the repeated lines removed - so in this example where the DENSE_RANK() returns 1 , 2 , 2 , 3, I'd like to eliminate the second line (where the recorded_arrival_time repeats) so that I'm left with this:
| trip_id | expect_arrival_time | recorded_arrival_time | stop_ranks |
|---|---|---|---|
| 10001444 | 2021-06-20 22:14:19 UTC | 2021-06-20 22:15:52 UTC | 1 |
| 10001444 | 2021-06-20 22:17:07 UTC | 2021-06-20 22:19:52 UTC | 2 |
| 10001444 | 2021-06-20 22:32:26 UTC | 2021-06-20 22:31:59 UTC | 3 |
Any way I could do this?
Solution 1:[1]
You could use a cte from your query as basis and then Group by the column taht are equal.
Like
WITH CTE AS (SELECT
route_id,
trip_id,
trip_date,
aimed_arrival_time,
CASE
when (datetime_diff(time_recorded, aimed_arrival_time, SECOND)/60) >= 1400 THEN date_add(time_recorded, INTERVAL 1 day)
when (datetime_diff(time_recorded, aimed_arrival_time, SECOND)/60) <= -1400 THEN date_add(time_recorded, INTERVAL 1 day)
ELSE time_recorded
END
AS time_recorded_adj,
DENSE_RANK() OVER(PARTITION BY trip_id, trip_date ORDER BY time_recorded) as stop_time_order
FROM
`bigquery-public-data.san_francisco_transit_muni.stop_monitoring`
WHERE
time_recorded >= '2020-01-01'
GROUP BY
route_id,
trip_id,
trip_date,
aimed_arrival_time,
time_recorded
ORDER BY
trip_id,
trip_date,
aimed_arrival_time)
SELECT
route_id, trip_id, trip_date, MIN(aimed_arrival_time), time_recorded_adj ,stop_time_order
FROM CTE
GROUP BY route_id, trip_id, trip_date, time_recorded_adj ,stop_time_order
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | nbk |
