'Modify date (month) in spark date column based on condition

I would like to modify my date column in spark df to subtract 1 month only if certain months appear. I.e. only if date is yyyy-07-31 or date is yyyy-04-30 change it to yyyy-06-31 and yyyy-03-30 respectively. Any ideas how to do it using pyspark functions?

|DateColumn|    
|2016-04-30|
|2015-04-30|
|2017-09-03|
|2017-07-31|
|2016-09-01|
|2018-07-31|
     ...

Expected outcome:

|DateColumn|    
|2016-03-30| <- changed
|2015-03-30| <- changed
|2017-09-03|
|2017-06-31| <- changed
|2016-09-01|
|2018-06-31| <- changed
     ...


Solution 1:[1]

I would recommend using the functions module and then combining several functionalities:

  • .when() and then otherwise()
  • .month()
  • .date_format()
  • .add_months(date, -1)

For example, it could come down to:

import pyspark.sql.functions as F

df = spark.createDataFrame([{'date': '2022-04-15'}, {'date': '2022-05-17'}])

df \
    .withColumn('new_date', 
                F.when(F.month(F.col('date')).isin([4, 7]), 
                       F.add_months(F.date_format('date', 'yyyy-MM-dd'), -1))
                .otherwise(F.col('date'))) \
    .show()

and you would then get:

+----------+----------+
|      date|  new_date|
+----------+----------+
|2022-04-15|2022-03-15|
|2022-05-17|2022-05-17|
+----------+----------+

Update (turned out it's rather substring problem than generic months subtracting, see comments for detail)

Actually, you can stack .when functionality to do some sort of if-elif-else in Python. Below code shows how to perform it in the scenario you shared:

from pyspark.sql.functions import when

df \
    .withColumn('new_date', 
                 when(F.substring(F.col("date"), -5, 5) == '01-31', F.concat(F.year(F.col("date")), F.lit('-12-31')))
                .when(F.substring(F.col("date"), -5, 5) == '04-30', F.concat(F.year(F.col("date")), F.lit('-03-30')))
                .when(F.substring(F.col("date"), -5, 5) == '07-31', F.concat(F.year(F.col("date")), F.lit('-06-30')))
                .when(F.substring(F.col("date"), -5, 5) == '10-31', F.concat(F.year(F.col("date")), F.lit('-09-30')))
                .otherwise(F.col('date'))) \
    .show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1