'Spark - Strict validate Date against format
I want to validate dates in a file against the format specified by user, if date are not exactly matching the format I need to set a False flag.
I am using Spark = spark-3.1.1
sample code:
spark = SparkSession.builder.getOrCreate()
schema = StructType([ \
StructField("date",StringType(),True), \
StructField("active", StringType(), True)
])
input_data = [
("Saturday November 2012 10:45:42.720+0100",'Y'),
("Friday April 2022 10:45:42.720-0800",'Y'),
("Friday April 20225 10:45:42.720-0800",'Y'),
("Friday April 202 10:45:42.720-0800",'Y'),
("Friday April 20 10:45:42.720-0800",'Y'),
("Friday April 1 10:45:42.720-0800",'Y'),
("Friday April 0 10:45:42.720-0800",'Y'),
]
date_format = "EEEE MMMM yyyy HH:mm:ss.SSSZ"
temp_df = spark.createDataFrame(data=input_data,schema=schema)
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = temp_df.select('*',
f.when(f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).isNotNull(), True).otherwise(False).alias('Date_validation'),
f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).alias('converted_date'),
)
df.show(truncate=False)
Where year are not strictly validated.
when tried with
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
I am faced with exception
py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString. : org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEEE MMMM yyyy HH:mm:ss.SSSZ' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Caused by: java.lang.IllegalArgumentException: Illegal pattern character: E
as per documnetaion :Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
is there anyway I could use best of both version ? to handle this scenario
expectation:
- EEEE -> should be able to handle days in week
- YYYY -> strict 4 year
- YY -> strict 2 year
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

