'Is there an efficient way to calculate number of rows since a change in a categorical variable?
I have a table that records a daily state for a user. I would like to calculate, for every row, the number of rows (base 1) the user has been in the same state.
I have a workable solution, but I suspect it is not very performant given the use of three window functions. This script is run over a table of c.50million rows and 10 columns.
Please could someone recommend a more efficient means of doing the following?
Current Solution
-- return the state in the previous day for a user
with lagged_data as (
select
user_id
,day
,state_end_of_day
,lag(state_end_of_day) over (partition by user_id order by day) as state_end_of_day_lag
,case when state_end_of_day_lag = state_end_of_day then 0 else 1 end as state_change_flag
from
stg_user_states_daily
)
-- use cumulative sum of the change flag to assign groups for each period in a state
, state_group_index as (
select
*
, sum(state_change_flag) over (partition by user_id order by day rows between unbounded preceding and current row) as state_group_index
from
lagged_data
)
-- count number of days user has been in current state on each day
, days_in_state as (
select
*
,sum(1) over (partition by user_id, state_group_index order by day rows between unbounded preceding and current row) as days_in_current_state
from
state_group_index
)
select
user_id
,day
,state_end_of_day
,days_in_current_state
from
days_in_state
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
