'AWS Glue Job outputs S3 to S3 data parque

I have S3 to S3 Glue spark job that will be used for running and landing the Athena table or view data back to an s3 catalog. The job should be job should be able to write corresponding to the set partition of the target s3 catalog and should doing an overwrite on the s3 partitions it is writing on. So what I first thought was to convert my glue dataframe to spark dataframe so I can just easily overwrite s3 bucket but I have problems with this logic

Junk Spark output file on S3 with dollar signs(Can be solve with 's3a' tho I have parquet files being outputted outside the partitions)
Performance

Dataframe - View Table

S3 Athena Table - Landing Table of View Table

df.toDF().write.mode("overwrite").format("parquet").partitionBy("column1", "column2", "column3").save(s3_athena_table_location)

Any suggestion to improve this? Also I need to remove duplicate records from my View table

aws-glue aws-glue-spark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'AWS Glue Job outputs S3 to S3 data parque

Sources

Related Questions