'Reduce table records based on minimum time difference
I have a log table (MS SQL SERVER) with event entries (events are user actions like "user logged in", "user viewed entity A" etc).
Some events like "user viewed entity A" may occur multiple times within a short time frame. For instance if a user goes back and forward in his browser he may enter entity A's page multiple times within a minute, and multiple "user view" events will be logged.
For my analytics dashboard I would like to count how many times a user viewed entity A, but I would like to "debounce" the result. I want to consider multiple "user view" events close to one another as one "user view" event. Specifically, I want to consider a new "user view" event only if it is more than 30 minutes from the last one.
So having a table like this (last column is my comments for clarity):
| timestamp | evt_type | user_id | entity_id | *time diff from previous event |
|---|---|---|---|---|
| 15:30 | ENTITY_VIEW | U1 | E1 | NULL (first view) |
| 15:38 | ENTITY_VIEW | U1 | E1 | 8mins |
| 16:05 | ENTITY_VIEW | U1 | E1 | 28mins |
| 16:50 | ENTITY_VIEW | U1 | E1 | 45mins (this counts as new view) |
| 17:15 | ENTITY_VIEW | U1 | E1 | 25mins |
| 17:44 | ENTITY_VIEW | U1 | E1 | 29mins |
| 18:30 | ENTITY_VIEW | U1 | E1 | 46mins (this counts as another view) |
I would like to determine that the user "viewed" the entity 3 times.
What would be a query to determine this? I tried LEAD, LAG, PARTITION BY and other comnbinations but I don't seem to find the correct way as I am not an SQL expert.
Solution 1:[1]
Should be a simple LAG() to grab the previous timestamp and check the diff. Will say your column [timestamp] is an odd data type, what about different days? Is there a separate column for date?
Return Records >30 Minutes from Previous Record
WITH cte_DeltaSinceLastView AS (
SELECT *
/*Grab previous record for each user_id/entity_id combo*/
,PrevTimestamp = LAG([timestamp]) OVER (PARTITION BY [user_id],[entity_id] ORDER BY [timestamp])
FROM YourTable
) AS A(ID,[user_id],[entity_id],[timestamp])
)
SELECT *,MinutesSinceLastView = DATEDIFF(minute,PrevTimestamp,[Timestamp])
FROM cte_DeltaSinceLastView
WHERE DATEDIFF(minute,PrevTimestamp,[timestamp]) > 30 /*Over 30 minutes between last view*/
OR PrevTimestamp IS NULL /*First view will not have previous timestamp to compare against*/
Solution 2:[2]
Something you could try is a correlated subquery that disregards any rows that are within 30 minutes of previous rows, the remaining rows should be the ones that qualify (ie a gap of 30+ minutes exists). See if this works for you?
select Sum(vc) as ViewedCount
from (
select case when exists (
select * from t t2
where t2.timestamp > t.timestamp
and t2.evt_type = t.evt_type
and t2.user_id = t.user_id
and t2.entity_id = t.entity_id
and DateDiff(minute, t.timestamp,t2.timestamp) <30
) then 0 else 1 end vc
from t
)b;
This assumes that Timestamp is a time data type. This won't work across day boundaries but the same concept should apply with a datetime type.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Stephan |
| Solution 2 |
