'Delete duplicates from a snowflake table without creating a new table
I have a snowflake table 'TEST' as below.
create OR REPLACE table TEST (
HASH VARCHAR,
COL1 VARCHAR,
COL2 VARCHAR,
COL3 VARCHAR
);
INSERT INTO TEST values ('cpdfer0', 'abc', 'xyz', 'I');
INSERT INTO TEST values ('sdfghj7', 'bnm', 'xyz', 'J');
INSERT INTO TEST values ('cpdfer0', 'abc', 'xyz', 'K');
INSERT INTO TEST values ('acvfgt9', 'xyz', 'bnm', 'K');
INSERT INTO TEST values ('zp3fert', 'cbc', 'lyz', 'M');
INSERT INTO TEST values ('acvfgt9', 'xyz', 'bnm', 'N');
I want to delete duplicate rows from TEST table on column 'HASH' (without creating a new table). I tried below query but it does not work. What is wrong with my query?
DELETE FROM TEST WHERE HASH NOT IN (
SELECT *
FROM (
SELECT *
,ROW_NUMBER() OVER (PARTITION BY HASH ORDER BY HASH) AS rbn
FROM TEST
)
WHERE rbn = 1);
Expected rows after dropping duplicates from TEST
Solution 1:[1]
Here is something that would work:
DELETE FROM test
WHERE hash||col1||col2||col3 IN (
SELECT hash||col1||col2||col3 FROM (
SELECT hash, col1, col2, col3,
row_number() OVER (PARTITION BY hash ORDER BY hash, col1, col2, col3) rbn
FROM test
) WHERE rbn > 1
);
The reason that I'm using hash||col1||col2||col3 instead of just hash is that I need to identify the exact row that I want to delete. If I used just hash to identify the rows that I want to delete, it would delete all rows with that hash. Note also that this code works correctly when each row for the same hash is different (like in your sample data, COL3 is different). If there were any exactly duplicate rows, it would delete all of them.
Solution 2:[2]
you could try something like this,
https://community.snowflake.com/s/question/0D50Z00008EJgemSAD/how-to-delete-duplicate-records-
create OR REPLACE table TEST (
HASH1 VARCHAR,
COL1 VARCHAR,
COL2 VARCHAR,
COL3 VARCHAR
);
INSERT INTO TEST values ('cpdfer0', 'abc', 'xyz', 'I');
INSERT INTO TEST values ('sdfghj7', 'bnm', 'xyz', 'J');
INSERT INTO TEST values ('cpdfer0', 'abc', 'xyz', 'K');
INSERT INTO TEST values ('acvfgt9', 'xyz', 'bnm', 'K');
INSERT INTO TEST values ('zp3fert', 'cbc', 'lyz', 'M');
INSERT INTO TEST values ('acvfgt9', 'xyz', 'bnm', 'N');
select * from test order by hash;
alter table test add id varchar(100);
update test set id = UUID_STRING();
delete from test
where id in (select id from ( SELECT t.* ,ROW_NUMBER() OVER (PARTITION BY HASH1 ORDER BY HASH1) AS rbn
from test t)
where rbn > 1);
alter table test drop id;
DELETE FROM test
WHERE hash1 NOT IN
(
SELECT MIN(hash1)
FROM test
GROUP BY hash1
)
Solution 3:[3]
This one is an interesting problem and have tried to get sort-of twisty work-around, given the fact Snowflake do not have any in-build pseudo columns in tables like rowid (as offered in other databases).
Idea used - Merge:
Construct an in-line view/subquery that will only have duplicate rows and that too just two records for each duplicate entry.
Now, change this sub-query set as - leaving one record intact (or original) and modify the second record to add some random char/string to the main duplicate column making it different or distinct from the original record. Once we have the above set (inline view/query) then use the same to MERGE. Merge against this inline view to DELETE on existing/similar records and insert back the ones that we have slightly modified making them different from the original. Also, while re-inserting we take-off what was added, to change it back as the original record.
I tested on limited data-set, not exactly sure how this would behave on a large or huge data-set. Also this is removing second row from the duplicate set, look into adjusting clause 'rn = 2', as needed. Main idea I tried was to remove without using pseudo-column.
Post back in case anyone has suggestions or find it not working.
Current state for the table -
select * from test_hash;
+---------+------+------+------+
| HASH | COL1 | COL2 | COL3 |
|---------+------+------+------|
| cpdfer0 | abc | xyz | I |
| sdfghj7 | bnm | xyz | J |
| cpdfer0 | abc | xyz | K |
| acvfgt9 | xyz | bnm | K |
| zp3fert | cbc | lyz | M |
| acvfgt9 | xyz | bnm | N |
+---------+------+------+------+
Main query used to remove duplicates
merge into TEST_HASH using
(
select case when rn=1 then HASH else HASH||'|randomstr' end HASH ,
col1,col2,col3
from (
select *,row_number()
over (partition by hash order by HASH) rn
from (select *,count(HASH)
over (partition by HASH order by HASH) cnt
from TEST_HASH qualify cnt>1) qualify rn<3
)
) d
on TEST_HASH.HASH = d.HASH
when matched then delete
when not matched then insert (HASH,col1,col2,col3)
values (substr(d.hash,1,position('|',d.hash)-1),d.col1,d.col2,d.col3);
State of the table after execution of remove duplicates -
select * from test_hash;
+---------+------+------+------+
| HASH | COL1 | COL2 | COL3 |
|---------+------+------+------|
| acvfgt9 | xyz | bnm | K |
| cpdfer0 | abc | xyz | I |
| sdfghj7 | bnm | xyz | J |
| zp3fert | cbc | lyz | M |
+---------+------+------+------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Maja F. |
| Solution 2 | Himanshu Kandpal |
| Solution 3 | Pankaj |

