'How implement different partitioning in spark?
There is a dataframew with columns: date and error. A partitioning for data should be the like yyyy-dd-mm but all columns with error should be written into different location. Is there a way to add custom partition resolving which generate path to date with one algorithm and different for error? In ideal way soldering structure should be like that:
my_table/data/date=2022-01-01/
my_table/error/
Any ideas?
P.S. Why custom partition? Yes I can add filter and the write twice - but this leads to twice reading. Using partition "hacking" (in theory) allows to read one and write once.
Solution 1:[1]
If possible you can cache the dataframe first and then filter and write twice so you will avoid reading the data twice.
Otherwise, without writing a new partitioner, you can add a new column using case when which maps each valid date to itself and each error to a NULL value and then use the standard partitionBy method in Spark.
The NULL values will be saved under a partition called __HIVE_DEFAULT_PARTITION__
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Guy |
