'Group time-series-like data by same values on PostgreSQL
I have a large table with structure like this (value: integer; created_at: timestamp with time zone)
| val | created_at |
|:----|:--------------------|
| 1 | 2021-01-05 13:07:23 |
| 2 | 2021-01-05 13:07:29 |
| 1 | 2021-01-05 13:07:39 |
| 1 | 2021-01-05 13:07:40 |
| 1 | 2021-01-05 13:07:41 |
| 1 | 2021-01-05 13:07:43 |
| 2 | 2021-01-05 13:07:46 |
| 2 | 2021-01-05 13:07:48 |
| 2 | 2021-01-05 13:07:52 |
| 1 | 2021-01-05 13:07:54 |
| 1 | 2021-01-05 13:07:58 |
| 1 | 2021-01-05 13:07:59 |
| 2 | 2021-01-05 13:08:01 |
| 1 | 2021-01-05 13:08:04 |
I want to group up records with same values and get lowest and highest created_at for each interval like this.
Also I need to handle intervals with only 1 record correctly (like at the start and at the end of the table)
| val | created_from | created_to |
|:----|:--------------------|:--------------------|
| 1 | 2021-01-05 13:07:23 | 2021-01-05 13:07:23 |
| 2 | 2021-01-05 13:07:29 | 2021-01-05 13:07:29 |
| 1 | 2021-01-05 13:07:39 | 2021-01-05 13:07:43 |
| 2 | 2021-01-05 13:07:46 | 2021-01-05 13:07:52 |
| 1 | 2021-01-05 13:07:54 | 2021-01-05 13:07:59 |
| 2 | 2021-01-05 13:08:01 | 2021-01-05 13:08:01 |
| 1 | 2021-01-05 13:08:04 | 2021-01-05 13:08:04 |
How can I achieve this in 1 SQL query?
Solution 1:[1]
First, we need to eliminate records in the middle of each constant interval. We get all records with different val comparing to next or prev record using lead and lag functions
select
*
from
(select
lag(val) over (order by created_at) prev_val,
val,
lead(val) over (order by created_at) next_val,
created_at
from tbl
order by created_at) t
where (
prev_val isnull
or next_val isnull
or val != next_val
or val != prev_val
)
| prev_val | val | next_val | created_at |
|:---------|:----|:---------|:--------------------|
| NULL | 1 | 2 | 2021-01-05 13:07:23 |
| 1 | 2 | 1 | 2021-01-05 13:07:29 |
| 2 | 1 | 1 | 2021-01-05 13:07:39 |
| 1 | 1 | 2 | 2021-01-05 13:07:43 |
| 1 | 2 | 2 | 2021-01-05 13:07:46 |
| 2 | 2 | 1 | 2021-01-05 13:07:52 |
| 2 | 1 | 1 | 2021-01-05 13:07:54 |
| 1 | 1 | 2 | 2021-01-05 13:07:59 |
| 1 | 2 | 1 | 2021-01-05 13:08:01 |
| 2 | 1 | NULL | 2021-01-05 13:08:04 |
So, here we have ends of each constant interval. Also we have records for single point intervals.
Next, we need to combine left and right ends of each interval.
select
val,
(case when prev_val_2 = val then prev_created_at_2 else created_at end) as created_from,
created_at as created_to
from
(select
lag(val) over (order by created_at) prev_val_2,
val,
lead(val) over (order by created_at) next_val_2,
lag(created_at) over (order by created_at) prev_created_at_2,
created_at
from
(select
lag(val) over (order by created_at) prev_val,
val,
lead(val) over (order by created_at) next_val,
created_at
from tbl
order by created_at) t
where (
(prev_val isnull
or next_val isnull
or val != next_val
or val != prev_val)
)) tt
where val != next_val_2 or next_val_2 isnull
| val | created_from | created_to |
|:----|:--------------------|:--------------------|
| 1 | 2021-01-05 13:07:23 | 2021-01-05 13:07:23 |
| 2 | 2021-01-05 13:07:29 | 2021-01-05 13:07:29 |
| 1 | 2021-01-05 13:07:39 | 2021-01-05 13:07:43 |
| 2 | 2021-01-05 13:07:46 | 2021-01-05 13:07:52 |
| 1 | 2021-01-05 13:07:54 | 2021-01-05 13:07:59 |
| 2 | 2021-01-05 13:08:01 | 2021-01-05 13:08:01 |
| 1 | 2021-01-05 13:08:04 | 2021-01-05 13:08:04 |
Explanation of (case when prev_val_2 = val then prev_created_at_2 else created_at end) expression:
We have two cases: single-point-interval like this
| prev_val | val | next_val | created_at |
|:---------|:----|:---------|:--------------------|
| NULL | 1 | 2 | 2021-01-05 13:07:23 |
| 1 | 2 | 1 | 2021-01-05 13:07:29 | <--
and many-points-interval like this
| prev_val | val | next_val | created_at |
|:---------|:----|:---------|:--------------------|
| 2 | 1 | 1 | 2021-01-05 13:07:39 | <--
| 1 | 1 | 2 | 2021-01-05 13:07:43 | <--
Condition lag(val) = val (prev_val_2 = val) stands for the second case (1=1), so here we take lag(created_at) (prev_created_at_2) as created_from: 2021-01-05 13:07:39.
Condition lag(val) != val (1!=2) stands for the first case, so
here we take created_at as both created_to and created_from: 2021-01-05 13:07:29
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
