'Duplicate rows instead of overwriting rows data using COPY command to load data from Amazon S3 to redshift

copy Agent1
from 's3://my-bucket/Reports/Historical Metrics Report (1).csv' 
iam_role 'arn:aws:iam::my-role:role/RedshiftRoleForS3'
csv
null as '\000'
IGNOREHEADER 1;

I am using this (above) to pull the data from s3 to redshift table. its working fine but there is one problem as when data is pulled/copied very first time it inserted into table but when the data get updated in s3 bucket file and we run the same query what it does is add the whole new rows of data instead of overwriting the already created rows.

How to stop duplication? i just want that when the data get updated on s3 file, after running Copy Command my data (rows) get overwritten and replaced the rows data with new data.

Here is the screenshot - rows are being added with updated data



Solution 1:[1]

Redshift doesn't enforce uniqueness. See - https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html

To update a table you will need to COPY the data to a "staging" table and perform an UPSERT process. See - https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html

Solution 2:[2]

If your goal is to empty the table prior to loading data via COPY, then you can use the TRUNCATE command to empty the table:

TRUNCATE tablename;

Note that this cannot be rolled-back within a commit.

Using TRUNCATE is much more efficient than DELETE FROM tablename because it immediately removes all storage associated with the table.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bill Weiner
Solution 2 John Rotenstein