'SQL BigQuery - COUNTIF with criteria from current row and partitioned rows
I'm running this line of code:
COUNTIF(
type = "credit"
AND
DATETIME_DIFF(credit_window_end, start_at_local_true_01, DAY) BETWEEN 0 and 5
)
over (partition by case_id order by start_at_local_true_01
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
as credit_count_per_case_id_in_future_and_within_credit_window,
And I'm getting this
| Row | case_id | start_at_local_true_01 | type | credit_window_end | credit_count_per_case_id_in_future_and_within_credit_window |
|---|---|---|---|---|---|
| 1 | 12123 | 2022-02-01 11:00:00 | null | 2022-02-06 11:00:00 | 0 |
| 2 | 12123 | 2022-02-01 11:15:00 | run | null | 0 |
| 3 | 12123 | 2022-02-01 11:21:00 | jump | 2022-02-06 11:21:00 | 0 |
| 4 | 12123 | 2022-02-04 11:31:00 | run | 2022-02-09 11:31:00 | 0 |
| 5 | 12123 | 2022-02-05 11:34:00 | jump | null | 0 |
| 6 | 12123 | 2022-02-08 12:38:00 | credit | null | 0 |
| 7 | 12555 | 2022-02-01 11:15:00 | null | null | 0 |
But I want this
| Row | case_id | start_at_local_true_01 | type | credit_window_end | credit_count_per_case_id_in_future_and_within_credit_window |
|---|---|---|---|---|---|
| 1 | 12123 | 2022-02-01 11:00:00 | null | 2022-02-06 11:00:00 | 0 |
| 2 | 12123 | 2022-02-01 11:15:00 | run | null | 0 |
| 3 | 12123 | 2022-02-01 11:21:00 | jump | 2022-02-06 11:21:00 | 0 |
| 4 | 12123 | 2022-02-04 11:31:00 | run | 2022-02-09 11:31:00 | 1 |
| 5 | 12123 | 2022-02-05 11:34:00 | jump | null | 0 |
| 6 | 12123 | 2022-02-08 12:38:00 | credit | null | 0 |
| 7 | 12555 | 2022-02-01 11:15:00 | null | null | 0 |
The 4th row should be 1 because (from the 6th row) credit = credit AND DATETIMEDIFF(2022-02-08T12:38:00, 2022-02-04 11:31:00, DAY) between 0 and 5
The calculation within the cell would look like this:
COUNTIF(
run = credit AND DATETIMEDIFF(2022-02-04 11:31:00, 2022-02-04T11:31:00, DAY ) between 0 and 5
jump = credit AND DATETIMEDIFF(2022-02-04 11:31:00, 2022-02-05T11:34:00, DAY ) between 0 and 5
credit = credit AND DATETIMEDIFF(2022-02-04 11:31:00, 2022-02-08T12:38:00, DAY ) between 0 and 5
)
COUNTIF(
false and false
false and false
true and true
)
COUNTIF(
0
0
1
)
I think I know why, but I don't know how to fix it.
It's because the DATETIME_DIFF function is taking both values from the same row (from each partitioned row). The second element should stay the same (start_at_local_true_01). But I want the first element to be fixed to the CURRENT ROW's credit_window_end (not each partitioned row's credit_window_end).
This is my code so far (including sample table):
with data_table as(
select * FROM UNNEST(ARRAY<STRUCT<
case_id INT64, start_at_local_true_01 DATETIME, type STRING, credit_window_end DATETIME>>
[
(12123, DATETIME("2022-02-01 11:00:00"), null, DATETIME("2022-02-06 11:00:00"))
,(12123, DATETIME("2022-02-01 11:15:00"), 'run', null)
,(12123, DATETIME("2022-02-01 11:21:00"), 'jump', DATETIME("2022-02-06 11:21:00"))
,(12123, DATETIME("2022-02-04 11:31:00"), 'run', DATETIME("2022-02-09 11:31:00"))
,(12123, DATETIME("2022-02-05 11:34:00"), 'jump', null)
,(12123, DATETIME("2022-02-08 12:38:00"), 'credit', null)
,(12555, DATETIME("2022-02-01 11:15:00"), null, null)
]
)
)
select
data_table.*,
COUNTIF(
type = "credit"
)
over (partition by case_id order by start_at_local_true_01
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
as credit_count_per_case_id_in_future,
COUNTIF(
type = "credit"
AND
DATETIME_DIFF(start_at_local_true_01, credit_window_end, DAY) BETWEEN 0 and 5
)
over (partition by case_id order by start_at_local_true_01
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
as credit_count_per_case_id_in_future_and_within_credit_window,
--does not work. does not even run
-- DATETIME_DIFF(
-- credit_window_end,
-- array_agg(
-- IFNULL(start_at_local_true_01,DATETIME("2000-01-01 00:00:00"))
-- )
-- over (partition by case_id order by start_at_local_true_01 asc
-- ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
-- , DAY
-- )
-- as credit_count_per_case_id_in_future_and_within_credit_window_02,
from data_table
Thanks for the help!
Solution 1:[1]
As confirmed by @Phil in the comments, this was solved by changing the window to:
over (partition by case_id order by UNIX_MILLIS(TIMESTAMP(start_at_local_true_01)) RANGE BETWEEN CURRENT ROW AND 432000000 FOLLOWING)
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
