'In Spark scala dataframe how do i get week end date based on week number
As per my business logic week start day is monday and week end day is sunday
I want to get week end date which is sunday based on week number , some year has 53 weeks , it is not working for 53rd week alone
Expected value for dsupp_trans_dt is 2021-01-03
but as per below code it is null
scala> case class Data(id:Int,weekNumber:String)
defined class Data
scala> var stgDF = Seq(Data(100,"53/2020")).toDF()
stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string]
scala> val weekNumber = "53/2020"
weekNumber: String = 53/2020
scala> val monthIdNum = "202001"
monthIdNum: String = 202001
scala> val isLeapYearFunc = (year: Int) => (((year % 4) == 0) && !(
| ((year % 100) == 0) &&
| !((year % 400) == 0))
| )
isLeapYearFunc: Int => Boolean = <function1>
scala> val isLeapYear = isLeapYearFunc(monthIdNum.substring(0,4).toInt)
isLeapYear: Boolean = true
scala> val kafkaFilePeriod = "2053"
kafkaFilePeriod: String = 2053
scala> stgDF = stgDF.withColumn("year_week_number",lit(weekNumber)).withColumn("is_leap_year",lit(isLeapYear)).withColumn("dsupp_trans_dt",
| when (col("is_leap_year") === true ,date_add(to_date(col("year_week_number"), "w/yyyy"),7)).otherwise(date_add(to_date(col("year_week_number"), "w/yyyy"),14)))
stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string ... 3 more fields]
scala> stgDF.show(10,false)
+---+----------+----------------+------------+--------------+
|id |weekNumber|year_week_number|is_leap_year|dsupp_trans_dt|
+---+----------+----------------+------------+--------------+
|100|53/2020 |53/2020 |true |null |
+---+----------+----------------+------------+--------------+
The same works fine for below
scala> val weekNumber = "52/2020"
weekNumber: String = 52/2020
scala> stgDF = stgDF.withColumn("year_week_number",lit(weekNumber)).withColumn("is_leap_year",lit(isLeapYear)).withColumn("dsupp_trans_dt",
| when (col("is_leap_year") === true ,date_add(to_date(col("year_week_number"), "w/yyyy"),7)).otherwise(date_add(to_date(col("year_week_number"), "w/yyyy"),14)))
stgDF: org.apache.spark.sql.DataFrame = [id: int, weekNumber: string ... 3 more fields]
scala> stgDF.show
+---+----------+----------------+------------+--------------+
| id|weekNumber|year_week_number|is_leap_year|dsupp_trans_dt|
+---+----------+----------------+------------+--------------+
|100| 53/2020| 52/2020| true| 2020-12-27|
+---+----------+----------------+------------+--------------+
Solution 1:[1]
You can use an user-defined function using new java time API.
First, you need to create a function that convert a String representing a week such as 53/2020 to the date of the Sunday of this week:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val toWeekDate = (weekNumber: String) => {
LocalDate.parse("7/" + weekNumber, DateTimeFormatter.ofPattern("e/w/YYYY"))
}
Where, for element of date pattern (see DateTimeFormatter's documentation for more details):
eis for the day of week (1 for Monday to 7 for Sunday)wis the week of yearYYYYis the week year: for instance, 01/01/2021 is in 2020 week year because it belongs to the 53rd week of 2020.
Then you convert it to an user-defined function and register it to your spark context:
import org.apache.spark.sql.functions.udf
val to_week_date = udf(toWeekDate)
spark.udf.register("to_week_date", to_week_date)
Finally you can use your user defined function when creating the new column:
import org.apache.spark.sql.functions.{col, lit}
val weekNumber = "53/2020"
stgDF
.withColumn("year_week_number",lit(weekNumber))
.withColumn("dsupp_trans_dt", to_week_date(col("year_week_number")))
Which gives you the following result:
+---+----------+----------------+--------------+
|id |weekNumber|year_week_number|dsupp_trans_dt|
+---+----------+----------------+--------------+
|100|53/2020 |53/2020 |2021-01-03 |
+---+----------+----------------+--------------+
Solution 2:[2]
The documentation for weekofyear spark function has the answer:
Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601.
It means that every year actually has 52 weeks plus n days, where n < 7.
For that reason, to_date considers 53/2020 as an incorrect date and returns null. For the same reason, to_date considers 01/2020 as invalid date because 01/2020 is actually 53th week of 2019 year.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Bondarenko |
