'How to combine duplicated rows into a single row?

I have a table with each row representing a person. In this table there are a lot of duplicates that I want to get rid of. I want to deduplicate based on name and age only. However, the information in columns can be spread between different rows for the same employee. For example:

name age height eye_color weight
John 32 null green null
John 32 null null 75
John 32 180 null null
John 32 null null 74

In this example, the expected output would be:

name age height eye_color weight
John 32 180 green 75

Note that it doesn't matter if the weight is 75 or 74, the order is not important for my use case, I just want to fill as much null as possible, while removing duplicates.

There is a unicity constraint on some columns, so simply updating all rows with the desired values and then keeping one row per group is not an option unfortunately, ie updating the table to look like this:

name age height eye_color weight
John 32 180 green 75
John 32 180 green 75
John 32 180 green 75
John 32 180 green 75

before deduplicating is not possible.

If age or name is null for an employee, it shouldn't be deduplicated at all.

A similar question on stackoverflow was Remove duplicates with less null values but they only keep the row with the least amount of null so it's not really solving my problem.

However, maybe there is something to do with some aggregation as shown in PostgreSQL: get first non null value per group but I couldn't get anything to work for the moment.

Any idea?

Thanks.



Solution 1:[1]

Simple group by combined with max as an aggregation function should do the trick

SELECT
 name,
 age,
 max(height),
 max(eye_color),
 max(weight)
FROM
 employees
GROUP BY
 name, age
WHERE
 name is not null and age is not null;

Here we always get biggest value so any other than null should be present if available.

For deduplicate data in your table, one way could be insert desired result into a temporary table, delete old data from employee table and insert data back from temporary table:

create table temporary_employee (
 name varchar,
 age integer,
 height integer,
 eye_color varchar,
 weight integer
);

insert into temporary_employee (name, age, height, eye_color, weight) 
(
SELECT
 name,
 age,
 max(height),
 max(eye_color),
 max(weight)
FROM
 employees
GROUP BY
 name, age
WHERE
 name is not null and age is not null
);

DELETE FROM employees;

INSERT INTO employees (name, age, height, eye_color, weight) (
SELECT name, age, height, eye_color, weight FROM temporary_employee);

DROP table temporary_employee;

In some database engines there is insert command with has an argument to overwrite all data, but I didn't find such paramter in PostgreSQL.

Another option could be add a temporary column into your table, insert wanted data to the table and delete old data.

ALTER TABLE employee ADD COLUMN newdata bool;

--Insert wanted data to the employee table and mark it as newdata
INSERT INTO employees (name, age, height, eye_color, weight, newdata) 
(
SELECT
 name,
 age,
 max(height),
 max(eye_color),
 max(weight),
 't' as newdata
FROM
 employees
GROUP BY
 name, age, newdata
WHERE
 name is not null and age is not null
);

-- Delete old data from the table
DELETE FROM employees WHERE newdata != 't';

---Remove temporary column
ALTER TABLE employees DROP COLUMN newdata;

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1