'postgresql: define query to get all first occurring events pr group

I have the following data structure:

CREATE TABLE test_table(
 id int8 PRIMARY KEY,
 patientid VARCHAR(255) NOT NULL,
 studyid VARCHAR(255) NOT NULL,
 gid VARCHAR(255) NOT NULL,
 labels int8,
 timestamp TIMESTAMP
 );

With the following data:

INSERT INTO test_table(id, patientid, studyid, gid, labels, timestamp)
VALUES (1, 1, 1, 1, 0, '2015-01-11 00:51:14'),
        (4, 1, 1, 1, 1, '2015-01-11 00:54:14'),
        (3, 1, 1, 1, 3, '2015-01-11 00:53:14'),
        (2, 1, 1, 1, 0, '2015-01-11 00:52:14'),
        (5, 1, 1, 1, 0, '2015-01-11 00:55:14'),
        (6, 1, 2, 2, 1, '2015-01-12 08:10:14'),
        (7, 1, 2, 2, 0, '2015-01-12 08:11:14'),
        (8, 1, 2, 2, 1, '2015-01-12 08:13:25'),
        (9, 2, 1, 3, 0, '2015-01-12 09:14:25'),
        (10, 2, 1, 3, 0, '2015-01-12 09:15:25'),
        (11, 2, 1, 3, 2, '2015-01-12 09:17:25'),
        (12, 2, 1, 3, 6, '2015-01-12 09:18:25'),
        (13, 3, 1, 4, 0, '2015-01-13 07:14:25'),
        (14, 3, 1, 4, 0, '2015-01-13 07:15:25'),
        (15, 3, 1, 4, 1, '2015-01-13 07:17:25'),
        (16, 3, 1, 4, 0, '2015-01-13 07:18:25'),
        (17, 3, 1, 4, 0, '2015-01-13 07:18:25'),
        (18, 4, 1, 5, 0, '2015-01-13 03:14:25'),
        (19, 4, 1, 5, 0, '2015-01-13 03:15:25'),
        (20, 4, 1, 5, 0, '2015-01-13 03:17:25'),
        (21, 4, 1, 5, 1, '2015-01-13 03:18:25'),
        (22, 4, 2, 6, 0, '2015-01-13 03:12:13'),
        (23, 4, 2, 6, 1, '2015-01-13 03:14:07'),
        (24, 4, 2, 6, 1, '2015-01-13 03:17:05'),
        (25, 4, 2, 6, 0, '2015-01-13 03:19:42'),
        (26, 4, 2, 6, 0, '2015-01-13 03:22:51')
;

I seek to retrieve the first occurring rows with labels=0 grouped by the combination of patientid and studyid.

The desired output contains the the following row ids: (1, 2, 7, 9, 10, 13, 14, 18, 19, 20, 22)

I have tried many things, but nothing seems viable. As an example I have tried the following (special thanks for help in this thread):

select rr.* from (
select tt.*,
       -- rank by group and sort
       rank() over (partition by patientid, studyid order by timestamp) rank
from test_table tt where labels = 0) rr
where rr.rank <= 20;

Unfortunately, the query above selects all 16 rows with labels=0. The number 20 is just an upper bound for how many occurrences of the labels=0 I suspect is possible.

I hope someone has the solution and an explanation on how to get the desired output.



Solution 1:[1]

select rr.* from (
select tt.*,
       -- rank by group and sort
       rank() over (partition by patientid, studyid order by timestamp) rank
from test_table tt where labels = 0) rr
-- you wanted only the first row in your desc; rank=1
-- your test output is rank <=2
where rr.rank <= 2;

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 PrasadU