'AWS Glue Job outputs S3 to S3 data parque
I have S3 to S3 Glue spark job that will be used for running and landing the Athena table or view data back to an s3 catalog. The job should be job should be able to write corresponding to the set partition of the target s3 catalog and should doing an overwrite on the s3 partitions it is writing on. So what I first thought was to convert my glue dataframe to spark dataframe so I can just easily overwrite s3 bucket but I have problems with this logic
- Junk Spark output file on S3 with dollar signs(Can be solve with 's3a' tho I have parquet files being outputted outside the partitions)
- Performance
Dataframe - View Table
S3 Athena Table - Landing Table of View Table
df.toDF().write.mode("overwrite").format("parquet").partitionBy("column1", "column2", "column3").save(s3_athena_table_location)
Any suggestion to improve this? Also I need to remove duplicate records from my View table
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
