'How to merge intervals based on a minimum gap of dates?
My current table is like the below, each patient has their visit start date and end date to a hospital, and they are administered a drug between admin_startdate and admin_enddate. For example, the first two rows mean, patient PT1 has two drug administrations, one between 01/08 and 01/10 & the other between 01/12 and 01/23, during her visit from 01/01 to 01/31.
ptid visit_start_date visit_end_date admin_startdate admin_enddate
PT1 2018-01-01 2018-01-31 2018-01-08 2018-01-10
PT1 2018-01-01 2018-01-31 2018-01-12 2018-01-23
PT2 2018-01-02 2018-01-18 2018-01-06 2018-01-11
PT2 2018-01-02 2018-01-18 2018-01-14 2018-01-17
What I would like to achieve is to lump together the drug administration that are too close together, say, the end date of the previous one is <= 2 days of the start date of new one, and call that a whole episode, like below:
ptid visit_start_date visit_end_date admin_startdate admin_enddate episode_startdate episode_enddate
PT1 2018-01-01 2018-01-31 2018-01-08 2018-01-10 2018-01-08 2018-01-23
PT1 2018-01-01 2018-01-31 2018-01-12 2018-01-23 2018-01-08 2018-01-23
PT2 2018-01-02 2018-01-18 2018-01-06 2018-01-11 2018-01-06 2018-01-11
PT2 2018-01-02 2018-01-18 2018-01-14 2018-01-17 2018-01-14 2018-01-17
You can see that PT1's two administrations are lumped together with the same episode_startdate and episode_enddate, whereas PT2's two administrations are considered two separate episode.
I have a hard time to figure out how to do it in PostgreSQL (Redshift).
Solution 1:[1]
This work in Postgres 14. Not tested for Redshift.
SELECT ptid, visit_start_date, visit_end_date, admin_startdate, admin_enddate
, min(admin_startdate) OVER (PARTITION BY visit_id, admin) AS episode_startdate
, max(admin_enddate) OVER (PARTITION BY visit_id, admin) AS episode_enddate
FROM (
SELECT *, count(*) FILTER (WHERE gap) OVER (PARTITION BY visit_id ORDER BY admin_startdate) AS admin
FROM (
SELECT *, admin_startdate - lag(admin_enddate) OVER (PARTITION BY visit_id ORDER BY admin_startdate) > 2 AS gap
FROM (
SELECT *, dense_rank() OVER (ORDER BY ptid, visit_start_date, visit_end_date) AS visit_id -- optional, to simplify
FROM tbl
) sub1
) sub2
) sub3
db<>fiddle here
The innermost subquery sub1 is only to compute a unique visit_id - which should really be in your table instead of repeating (ptid, visit_start_date, visit_end_date ) over and over. Consider normalizing your design at least that much.
The next subquery sub2 checks for a gap that's greater than two days to the previous row in the same partition.
Subquery sub3 then counts those gaps to identify distinct administration periods (admin)
In the outer SELECT, min(admin_startdate) and max(admin_enddate) per administration period produce the desired episode dates.
See (with assorted links to more):
Solution 2:[2]
CREATE TABLE tb1 AS (
SELECT *, admin_startdate - lag(admin_enddate) OVER (PARTITION BY visit_id ORDER BY admin_startdate) > 2 AS gap
FROM (
SELECT *, dense_rank() OVER (ORDER BY ptid, visit_start_date, visit_end_date) AS visit_id -- optional, to simplify
FROM tbl
) sub1
) ;
CREATE TABLE tb2 AS (
SELECT *, count(*) OVER (PARTITION BY visit_id ORDER BY admin_startdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS admin
FROM tb1
WHERE gap is True
)
;
CREATE TABLE tb3 AS (
SELECT tb1.ptid, tb1.visit_start_date, tb1.visit_end_date, tb1.admin_startdate, tb1.admin_enddate, tb1.visit_id, tb1.gap,
CASE WHEN tb2.admin is NULL THEN 0 else tb2.admin END AS admin
FROM tb1
LEFT JOIN tb2
ON tb1.ptid = tb2.ptid AND tb1.visit_start_date = tb2.visit_start_date AND tb1.visit_end_date = tb2.visit_end_date AND tb1.admin_startdate = tb2.admin_startdate AND tb1.admin_enddate = tb2.admin_enddate AND tb1.visit_id = tb2.visit_id
)
;
CREATE TABLE tb4 AS (
SELECT ptid, visit_start_date, visit_end_date, admin_startdate, admin_enddate
, min(admin_startdate) OVER (PARTITION BY visit_id, admin) AS episode_startdate
, max(admin_enddate) OVER (PARTITION BY visit_id, admin) AS episode_enddate
FROM tb3
)
This is an uglier version adapted from Erwin's answer for Redshift, which does not support FILTER operation. Tested correctly at least on db fiddle
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Erwin Brandstetter |
| Solution 2 |
