'How to combine duplicated rows into a single row?
I have a table with each row representing a person. In this table there are a lot of duplicates that I want to get rid of. I want to deduplicate based on name and age only. However, the information in columns can be spread between different rows for the same employee. For example:
| name | age | height | eye_color | weight |
|---|---|---|---|---|
| John | 32 | null | green | null |
| John | 32 | null | null | 75 |
| John | 32 | 180 | null | null |
| John | 32 | null | null | 74 |
In this example, the expected output would be:
| name | age | height | eye_color | weight |
|---|---|---|---|---|
| John | 32 | 180 | green | 75 |
Note that it doesn't matter if the weight is 75 or 74, the order is not important for my use case, I just want to fill as much null as possible, while removing duplicates.
There is a unicity constraint on some columns, so simply updating all rows with the desired values and then keeping one row per group is not an option unfortunately, ie updating the table to look like this:
| name | age | height | eye_color | weight |
|---|---|---|---|---|
| John | 32 | 180 | green | 75 |
| John | 32 | 180 | green | 75 |
| John | 32 | 180 | green | 75 |
| John | 32 | 180 | green | 75 |
before deduplicating is not possible.
If age or name is null for an employee, it shouldn't be deduplicated at all.
A similar question on stackoverflow was Remove duplicates with less null values but they only keep the row with the least amount of null so it's not really solving my problem.
However, maybe there is something to do with some aggregation as shown in PostgreSQL: get first non null value per group but I couldn't get anything to work for the moment.
Any idea?
Thanks.
Solution 1:[1]
Simple group by combined with max as an aggregation function should do the trick
SELECT
name,
age,
max(height),
max(eye_color),
max(weight)
FROM
employees
GROUP BY
name, age
WHERE
name is not null and age is not null;
Here we always get biggest value so any other than null should be present if available.
For deduplicate data in your table, one way could be insert desired result into a temporary table, delete old data from employee table and insert data back from temporary table:
create table temporary_employee (
name varchar,
age integer,
height integer,
eye_color varchar,
weight integer
);
insert into temporary_employee (name, age, height, eye_color, weight)
(
SELECT
name,
age,
max(height),
max(eye_color),
max(weight)
FROM
employees
GROUP BY
name, age
WHERE
name is not null and age is not null
);
DELETE FROM employees;
INSERT INTO employees (name, age, height, eye_color, weight) (
SELECT name, age, height, eye_color, weight FROM temporary_employee);
DROP table temporary_employee;
In some database engines there is insert command with has an argument to overwrite all data, but I didn't find such paramter in PostgreSQL.
Another option could be add a temporary column into your table, insert wanted data to the table and delete old data.
ALTER TABLE employee ADD COLUMN newdata bool;
--Insert wanted data to the employee table and mark it as newdata
INSERT INTO employees (name, age, height, eye_color, weight, newdata)
(
SELECT
name,
age,
max(height),
max(eye_color),
max(weight),
't' as newdata
FROM
employees
GROUP BY
name, age, newdata
WHERE
name is not null and age is not null
);
-- Delete old data from the table
DELETE FROM employees WHERE newdata != 't';
---Remove temporary column
ALTER TABLE employees DROP COLUMN newdata;
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
