'Delta Table optimize/vacuum
I have files being written by a kubernetes job(running on prem) into adls gen2 container in the form of Delta table.(spark on Kubernetes , which helps me in writing delta tables on adls)
files are huge in number flowing every hour ( small+big files) and we want to optimize/vacuum the delta table .
Is there is an automatic way / setting with which we can auto optimize & vacuum the delta table .
I've read this article on auto optimization but its still unclear if this can help me.
Thank you, Rahul Kishore
Solution 1:[1]
The linked article references the feature of the Delta on Databricks where it will try to produce bigger files when writing data - this is different from the automatic execution of OPTIMIZE/VACUUM.
Even on Databricks, you need to run VACUUM explicitly - just create a small Spark job that will execute VACUUM on selected table(s) - just follow documentation for correct syntax & settings.
Please note that as of right now OPTIMIZE is available only on Databricks, if you're using OSS Delta you can emulate it by reading all or part of data, repartition it for optimal file size & write it back in overwrite mode. (be careful when you optimize only part of the data - use replaceWhere option as shown in documentation)
Update Feb 2022nd: OPTIMIZE for OSS Delta Lake is on the roadmap for first half of the year.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
