'Full outer join not giving the answer I need
I am using PostgreSQL and am having difficulty with getting a series of queries that combine the data from two tables (t1, t2)
t1 is
| studyida | gender | age |
|---|---|---|
| a | M | 1 |
| a | M | 2 |
| a | M | 3 |
| b | F | 4 |
| b | F | 5 |
| b | F | 6 |
| c | M | 13 |
| c | M | 14 |
| c | M | 15 |
and t2 is
| studyida | studyidb | gender | age |
|---|---|---|---|
| a | z | M | 3 |
| a | z | M | 4 |
| a | z | M | 5 |
| NULL | y | F | 7 |
| NULL | y | F | 8 |
| NULL | y | F | 9 |
| c | x | M | 10 |
| c | x | M | 11 |
| c | x | M | 12 |
| NULL | w | F | 7 |
| NULL | w | F | 8 |
| NULL | w | F | 9 |
| NULL | u | M | 7 |
| NULL | u | M | 8 |
| NULL | u | M | 9 |
t1 and t2 are related via StudyIDA and gender. What I need is a comprehensive listing from both tables, including including the ages. Sometimes the age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3) but most of the time it does not.
I am looking to create a table like this
| StudyIDA | StudyIDB | gender | ageA | ageB |
|---|---|---|---|---|
| a | z | M | 1 | |
| a | z | M | 2 | |
| a | z | M | 3 | 3 |
| a | z | M | 4 | |
| a | z | M | 5 | |
| b | NULL | F | 4 | |
| b | NULL | F | 5 | |
| b | NULL | F | 6 | |
| NULL | y | F | 7 | |
| NULL | y | F | 8 | |
| NULL | y | F | 9 | |
| c | x | F | 13 | |
| c | x | F | 14 | |
| c | x | F | 15 | |
| c | x | F | 10 | |
| c | x | F | 11 | |
| c | x | F | 12 | |
| NULL | w | F | 7 | |
| NULL | w | F | 8 | |
| NULL | w | F | 9 | |
| NULL | u | M | 7 | |
| NULL | u | M | 8 | |
| NULL | u | M | 9 |
I was thinking that first a full outer join of t1 and t2 would give me what I want but it does not.
Then I was thinking I need a listing of all the individuals (lets call it t3), and then do a series of inserts (e.g. t1+t3 and also t1+t3) into a new table to 'construct' what I need. I am really stuck on the odd times when age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3).
I am still not getting what I need. Here is my code so far
DROP TABLE IF EXISTS t1, t2, t3;
CREATE TEMPORARY TABLE t1 (StudyIDA VARCHAR, gender VARCHAR, age int);
INSERT INTO t1 VALUES
('a','M', 1),('a','M', 2),('a','M', 3),
('b','F', 4),('b','F', 5),('b','F', 6),
('c','M', 13),('c','M', 14),('c','M', 15);
SELECT * FROM t1;
CREATE TEMPORARY TABLE t2 (StudyIDA VARCHAR, StudyIDB varchar, gender VARCHAR, age int);
INSERT INTO t2 VALUES
('a','z','M', 3), ('a','z','M', 4), ('a','z','M', 5),
(NULL,'y','F', 7),(NULL,'y','F', 8),(NULL,'y','F', 9),
('c','x','M', 10),('c','x','M', 11),('c','x','M', 12),
(NULL,'w','F', 7),(NULL,'w','F', 8),(NULL,'w','F', 9),
(NULL,'u','M', 7),(NULL,'u','M', 8),(NULL,'u','M', 9);
SELECT * FROM t2;
CREATE TEMPORARY TABLE t3 (StudyIDA_t1 VARCHAR, gender_t1 VARCHAR, StudyIDA_t2 VARCHAR,StudyIDB varchar,
gender_t2 VARCHAR);
INSERT INTO t3
SELECT * FROM (SELECT DISTINCT StudyIDA, gender FROM t1) a FULL OUTER JOIN
(SELECT DISTINCT StudyIDA, StudyIDB, gender FROM t2) b
ON a.StudyIDA=b.StudyIDA AND a.gender=b.gender
ORDER BY a.StudyIDA;
SELECT * FROM t3 ORDER BY StudyIDA_t1;
SELECT 'IN t1', *
FROM t3 JOIN t1 on t1.StudyIDA=t3.StudyIDA_t1 AND t1.gender=t3.gender_t1
ORDER BY StudyIDA_t1, StudyIDB;
SELECT 'In t2',*
FROM t3 JOIN t2 on t3.StudyIDA_t1=t2.StudyIDA AND t3.gender_t1=t2.gender
ORDER BY StudyIDA_t1, t3.StudyIDB;
DROP TABLE IF EXISTS t1, t2, t3;
Solution 1:[1]
Your sample data indicates that only t2.studyida can be NULL and all other columns should really be declared as NOT NULL.
If so, I suggest this simpler query:
SELECT studyida, b.studyidb, gender, age
, CASE WHEN a.age IS NULL THEN 'b'
WHEN b.age IS NULL THEN 'a'
ELSE 'a and b' END as source
FROM t1 a
FULL JOIN t2 b USING (studyida, gender, age)
ORDER BY studyida, gender, age;
db<>fiddle here
The USING clause is convenient for identically named join columns. Only a single instance of the joining column is in the result set, effectively what COALESCE(a.col, b.col) gives you otherwise. (You might just use SELECT *.)
You can still reference source columns with table-qualification, like a.age.
I reduced to a single age column and added source. You may or may not want that.
Either way, "age" is subject to bitrot, almost always the wrong choice for a table column, and should typically be replaced by "birthday" or similar.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Erwin Brandstetter |
