'Date from week date format: 2022-W02-1 (ISO 8601) [duplicate]

Having a date, I create a column with ISO 8601 week date format:

from pyspark.sql import functions as F
df = spark.createDataFrame([('2019-03-18',), ('2019-12-30',), ('2022-01-03',), ('2022-01-10',)], ['date_col'])

df = df.withColumn(
    'iso_from_date',
    F.concat_ws(
        '-',
        F.expr('extract(yearofweek from date_col)'),
        F.lpad(F.weekofyear('date_col'), 3, 'W0'),
        F.expr('weekday(date_col) + 1')
    )
)
df.show()
# +----------+-------------+
# |  date_col|iso_from_date|
# +----------+-------------+
# |2019-03-18|   2019-W12-1|
# |2019-12-30|   2020-W01-1|
# |2022-01-03|   2022-W01-1|
# |2022-01-10|   2022-W02-1|
# +----------+-------------+

Using Spark 3, how to get back the date, given ISO 8601 week date?

I tried the following, but it is both, incorrect and uses LEGACY configuration which I don't like.

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df.withColumn('date_from_iso', F.to_date('iso_from_date', "YYYY-'W'ww-uu")).show()
# +----------+-------------+-------------+
# |  date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18|   2019-W12-1|   2019-03-18|
# |2019-12-30|   2020-W01-1|   2019-12-30|
# |2022-01-03|   2022-W01-1|   2021-12-27|
# |2022-01-10|   2022-W02-1|   2022-01-03|
# +----------+-------------+-------------+

I am aware of the possibility to create a udf which works:

import datetime
@F.udf('date')
def iso_to_date(iso_date):
    return datetime.datetime.strptime(iso_date, '%G-W%V-%u')

df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()

But I am looking for a more efficient option. The ideal option should not use LEGACY configuration and be translatable to SQL or Scala (no inefficient udf).



Solution 1:[1]

In PySpark, I have found a nicer than udf option. This will use pandas_udf which is vectorized (more efficient):

import pandas as pd
@F.pandas_udf('date')
def iso_to_date(iso_date: pd.Series) -> pd.Series:
    return pd.to_datetime(iso_date, format='%G-W%V-%u')

df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
# +----------+-------------+-------------+
# |  date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18|   2019-W12-1|   2019-03-18|
# |2019-12-30|   2020-W01-1|   2019-12-30|
# |2022-01-03|   2022-W01-1|   2022-01-03|
# |2022-01-10|   2022-W02-1|   2022-01-10|
# +----------+-------------+-------------+

It works in Spark 3 without the LEGACY configuration. So it's acceptable.

However, there is room for improvement, as this option is not transferable to SQL or Scala.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ZygD