'How to send data from S3 to redshift without duplicating and truncating table By LAMBDA?

I wish to copy data from S3 to Redshift.

However, the Copy command always duplicates the rows whenever the Lambda function triggers:

cur.execute('copy table from S3...... ' )

Can someone suggest other ways to do it without truncating existing data?

for commenters: I tried to push directly from the dataframe to redshift.. append

There is one library pandas_redshift but it needs s3 connection first which might solve appending issue)

I also tried #if cur.execute('truncate') it can keep the table empty but I don't have delete rights

cur.execute('select distinct * from ABC.xyz')

cur.execute('copy......')

results keep appending...

Can someone please provide any code or right series of execution.



Solution 1:[1]

Unfortunately there is no straight forward option to copy the files to perform upsert which can handle duplicates.

If you don't want to truncate the table, there are two workarounds:

  1. You can create a staging table where you can copy the data first and then perform merge option. That will also act as upsert statement.

    https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html

  2. You can use manifest to control which files you want to copy and which needs to be avoided. https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#copy-command-examples-manifest

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nishu Tayal