'In Spark scala dataframe how do i get week end date based on week number

As per my business logic week start day is monday and week end day is sunday

I want to get week end date which is sunday based on week number , some year has 53 weeks , it is not working for 53rd week alone

Expected value for dsupp_trans_dt is 2021-01-03

but as per below code it is null

 scala>  case class Data(id:Int,weekNumber:String)
 defined class Data

 scala>  var stgDF = Seq(Data(100,"53/2020")).toDF()
 stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string]

 scala>  val weekNumber = "53/2020"
 weekNumber: String = 53/2020

 scala>  val monthIdNum = "202001"
 monthIdNum: String = 202001

 scala>         val isLeapYearFunc = (year: Int) => (((year % 4) == 0) && !(
     |           ((year % 100) == 0) &&
     |             !((year % 400) == 0))
     |           )
  isLeapYearFunc: Int => Boolean = <function1>

 scala>  val isLeapYear = isLeapYearFunc(monthIdNum.substring(0,4).toInt)
 isLeapYear: Boolean = true

 scala>  val kafkaFilePeriod = "2053"
 kafkaFilePeriod: String = 2053

 scala> stgDF = stgDF.withColumn("year_week_number",lit(weekNumber)).withColumn("is_leap_year",lit(isLeapYear)).withColumn("dsupp_trans_dt",
 | when (col("is_leap_year") === true ,date_add(to_date(col("year_week_number"), "w/yyyy"),7)).otherwise(date_add(to_date(col("year_week_number"), "w/yyyy"),14)))
 stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string ... 3 more fields]

 scala> stgDF.show(10,false)

 +---+----------+----------------+------------+--------------+
 |id |weekNumber|year_week_number|is_leap_year|dsupp_trans_dt|
 +---+----------+----------------+------------+--------------+
 |100|53/2020   |53/2020         |true        |null          |
 +---+----------+----------------+------------+--------------+

The same works fine for below

 scala> val weekNumber = "52/2020"
 weekNumber: String = 52/2020

  scala> stgDF = stgDF.withColumn("year_week_number",lit(weekNumber)).withColumn("is_leap_year",lit(isLeapYear)).withColumn("dsupp_trans_dt",
 | when (col("is_leap_year") === true ,date_add(to_date(col("year_week_number"), "w/yyyy"),7)).otherwise(date_add(to_date(col("year_week_number"), "w/yyyy"),14)))
  stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string ... 3 more fields]

  scala> stgDF.show
  +---+----------+----------------+------------+--------------+
  | id|weekNumber|year_week_number|is_leap_year|dsupp_trans_dt|
  +---+----------+----------------+------------+--------------+
  |100|   53/2020|         52/2020|        true|    2020-12-27|
  +---+----------+----------------+------------+--------------+


Solution 1:[1]

You can use an user-defined function using new java time API.

First, you need to create a function that convert a String representing a week such as 53/2020 to the date of the Sunday of this week:

import java.time.LocalDate
import java.time.format.DateTimeFormatter

val toWeekDate = (weekNumber: String) => {
  LocalDate.parse("7/" + weekNumber, DateTimeFormatter.ofPattern("e/w/YYYY"))
}

Where, for element of date pattern (see DateTimeFormatter's documentation for more details):

  • e is for the day of week (1 for Monday to 7 for Sunday)
  • w is the week of year
  • YYYY is the week year: for instance, 01/01/2021 is in 2020 week year because it belongs to the 53rd week of 2020.

Then you convert it to an user-defined function and register it to your spark context:

import org.apache.spark.sql.functions.udf

val to_week_date = udf(toWeekDate)
spark.udf.register("to_week_date", to_week_date)

Finally you can use your user defined function when creating the new column:

import org.apache.spark.sql.functions.{col, lit}

val weekNumber = "53/2020"

stgDF
  .withColumn("year_week_number",lit(weekNumber))
  .withColumn("dsupp_trans_dt", to_week_date(col("year_week_number")))

Which gives you the following result:

+---+----------+----------------+--------------+
|id |weekNumber|year_week_number|dsupp_trans_dt|
+---+----------+----------------+--------------+
|100|53/2020   |53/2020         |2021-01-03    |
+---+----------+----------------+--------------+

Solution 2:[2]

The documentation for weekofyear spark function has the answer:

Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601.

It means that every year actually has 52 weeks plus n days, where n < 7. For that reason, to_date considers 53/2020 as an incorrect date and returns null. For the same reason, to_date considers 01/2020 as invalid date because 01/2020 is actually 53th week of 2019 year.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Bondarenko