'Spark - Strict validate Date against format

I want to validate dates in a file against the format specified by user, if date are not exactly matching the format I need to set a False flag.

I am using Spark = spark-3.1.1

sample code:

spark = SparkSession.builder.getOrCreate()
schema = StructType([ \
    StructField("date",StringType(),True), \
    StructField("active", StringType(), True)
    ])

input_data = [
    ("Saturday November 2012 10:45:42.720+0100",'Y'),
    ("Friday April 2022 10:45:42.720-0800",'Y'),
    ("Friday April 20225 10:45:42.720-0800",'Y'),
    ("Friday April 202 10:45:42.720-0800",'Y'),
    ("Friday April 20 10:45:42.720-0800",'Y'),
    ("Friday April 1 10:45:42.720-0800",'Y'),
    ("Friday April 0 10:45:42.720-0800",'Y'),
     ]

date_format = "EEEE MMMM yyyy HH:mm:ss.SSSZ"

temp_df = spark.createDataFrame(data=input_data,schema=schema)

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

df = temp_df.select('*', 
                f.when(f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).isNotNull(), True).otherwise(False).alias('Date_validation'),
                f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).alias('converted_date'),
                )
df.show(truncate=False)

Which give output like: enter image description here

Where year are not strictly validated.

when tried with spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")

I am faced with exception

py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString. : org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEEE MMMM yyyy HH:mm:ss.SSSZ' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Caused by: java.lang.IllegalArgumentException: Illegal pattern character: E

as per documnetaion :Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.

is there anyway I could use best of both version ? to handle this scenario

expectation:

  • EEEE -> should be able to handle days in week
  • YYYY -> strict 4 year
  • YY -> strict 2 year


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source