'Date from week date format: 2022-W02-1 (ISO 8601) [duplicate]
Having a date, I create a column with ISO 8601 week date format:
from pyspark.sql import functions as F
df = spark.createDataFrame([('2019-03-18',), ('2019-12-30',), ('2022-01-03',), ('2022-01-10',)], ['date_col'])
df = df.withColumn(
'iso_from_date',
F.concat_ws(
'-',
F.expr('extract(yearofweek from date_col)'),
F.lpad(F.weekofyear('date_col'), 3, 'W0'),
F.expr('weekday(date_col) + 1')
)
)
df.show()
# +----------+-------------+
# | date_col|iso_from_date|
# +----------+-------------+
# |2019-03-18| 2019-W12-1|
# |2019-12-30| 2020-W01-1|
# |2022-01-03| 2022-W01-1|
# |2022-01-10| 2022-W02-1|
# +----------+-------------+
Using Spark 3, how to get back the date, given ISO 8601 week date?
I tried the following, but it is both, incorrect and uses LEGACY
configuration which I don't like.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df.withColumn('date_from_iso', F.to_date('iso_from_date', "YYYY-'W'ww-uu")).show()
# +----------+-------------+-------------+
# | date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18| 2019-W12-1| 2019-03-18|
# |2019-12-30| 2020-W01-1| 2019-12-30|
# |2022-01-03| 2022-W01-1| 2021-12-27|
# |2022-01-10| 2022-W02-1| 2022-01-03|
# +----------+-------------+-------------+
I am aware of the possibility to create a udf
which works:
import datetime
@F.udf('date')
def iso_to_date(iso_date):
return datetime.datetime.strptime(iso_date, '%G-W%V-%u')
df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
But I am looking for a more efficient option. The ideal option should not use LEGACY
configuration and be translatable to SQL or Scala (no inefficient udf
).
Solution 1:[1]
In PySpark, I have found a nicer than udf
option. This will use pandas_udf
which is vectorized (more efficient):
import pandas as pd
@F.pandas_udf('date')
def iso_to_date(iso_date: pd.Series) -> pd.Series:
return pd.to_datetime(iso_date, format='%G-W%V-%u')
df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
# +----------+-------------+-------------+
# | date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18| 2019-W12-1| 2019-03-18|
# |2019-12-30| 2020-W01-1| 2019-12-30|
# |2022-01-03| 2022-W01-1| 2022-01-03|
# |2022-01-10| 2022-W02-1| 2022-01-10|
# +----------+-------------+-------------+
It works in Spark 3 without the LEGACY
configuration. So it's acceptable.
However, there is room for improvement, as this option is not transferable to SQL or Scala.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | ZygD |