'Add current timestamp to Spark dataframe but partition it by the current date without adding it to the dataframe
I understand we can add current timestamp to a dataframe by doing this:
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
However if we'd like to partition it by the current date at the point of saving it as a parquet file by deriving it from the timestamp without adding it to the dataframe, would that be possible? What I am trying to achieve would be something like this:
df.write.partitionBy(date("time_stamp")).parquet("/path/to/file")
Solution 1:[1]
You can't do that. partitionBy must specify the name of a column or columns of the dataset. In addition, when reading table data, spark implements Partition Discovery according to the storage structure.
Solution 2:[2]
As explained by @??? , partitionBy takes in a column , and you cannot supply a calculated field
You can implicitly create a column using current_date , and use that in partitionBy , the current_date column that you have created will anyways not be part of your data dump
import org.apache.spark.sql.functions.current_date
df.withColumn("current_date", current_date())
df.write.partitionBy(current_date).parquet("/path/to/file")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | 过过招 |
| Solution 2 | Vaebhav |
