'How implement different partitioning in spark?

There is a dataframew with columns: date and error. A partitioning for data should be the like yyyy-dd-mm but all columns with error should be written into different location. Is there a way to add custom partition resolving which generate path to date with one algorithm and different for error? In ideal way soldering structure should be like that:

my_table/data/date=2022-01-01/
my_table/error/

Any ideas?

P.S. Why custom partition? Yes I can add filter and the write twice - but this leads to twice reading. Using partition "hacking" (in theory) allows to read one and write once.

Solution 1:^[1]

If possible you can cache the dataframe first and then filter and write twice so you will avoid reading the data twice.

Otherwise, without writing a new partitioner, you can add a new column using case when which maps each valid date to itself and each error to a NULL value and then use the standard partitionBy method in Spark. The NULL values will be saved under a partition called __HIVE_DEFAULT_PARTITION__

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Guy

'How implement different partitioning in spark?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]