'Spark source ORC almost tripples after dropDublicates
I have a spark streaming pipeline that writes data to s3 every 5 minutes. This creates multiple small files each day. I have a logic in place to combine small ORC files and write a large ORC file to tackle small files problem. But to my surprise, the size of the ORC has increased almost 2-3 times after dropDublicates. Below is the code snippet.
val temp = spark.read.schema(schema).orc(rawPath)
val raw = temp.orderBy(columns.map(c => col(c)):_*)
.dropDuplicates(columns)
I placed the orderBy here to force spark to do a range Partitioning before doing the dropDublicate. But this didnot help with the size issue.
In one instance, if the size of small files was 168MB then after distinct the size was 478MB. This does not make any sence as removing dublicate rows should have reduced the size of ORC right?
Thank you in advance
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|