'Find rows with given values if no result, return highest value possible

I have a Spark dataframe. I need to get the average property price for a given region at a given month. If no data is found for that month, then I need to get the latest data month if it exists.

My data looks like this

+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city        |dpt_code|city_code|
+-----+-------------+-----------+----------+-------------+--------+---------|
|  11 |House        |XXXXX      |2834      |FOO          |123     |1        |
|  11 |House        |XXXXY      |870       |NEAR_FOO     |123     |2        |
|   2 |House        |YYYYY      |732       |LA           |100     |5        |
|   3 |House        |YYYYX      |2361      |NEAR_LA      |100     |6        |
|   11|House        |ZZZZZ      |2162      |ATL          |105     |9        |

so if let's say I chose dpt_code = 123 and month = 11 I get:

+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city        |dpt_code|city_code|
+-----+-------------+-----------+----------+-------------+--------+---------|
|  11 |House        |XXXXX      |2834      |FOO          |123     |1        |
|  11 |House        |XXXXY      |870       |NEAR_FOO     |123     |2        |

This is the easy case. Now what I don't know how to achieve is:

Let say dpt_code = 100 and month = 11 I want this returned:

+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city        |dpt_code|city_code|
|   2 |House        |YYYYY      |732       |LA           |100     |5        |
|   3 |House        |YYYYX      |2361      |NEAR_LA      |100     |6        |

For now I think this code works for the first part:

def get_mean_prices_by_dept_and_month(df, dept, month):
    df.filter((df["dept_code"] == dept) & (df["month"] == month)) 
    # if month condition isn't satisfied then do same request but 
    # with df["month"] = latest month possible for that dept_code, else null.


Solution 1:[1]

You will have to apply a count of the dept_code and month filtered dataframe to know it will contain data. Then depending on this, apply an action to find the maximum month for a dept_code and return data matching the maximum month.

from pyspark.sql import functions as F

data = [(11, "House", "XXXXX", "2834", "FOO", 123, 1,),
 (11, "House", "XXXXY", "870", "NEAR_FOO", 123, 2,),
 (2, "House", "YYYYY", "732", "LA", 100, 5,),
 (3, "House", "YYYYX", "2361", "NEAR_LA", 100, 6,),
 (11, "House", "ZZZZZ", "2162", "ATL", 105, 9,), ]

df = spark.createDataFrame(data, ("month", "property_type", "postal_code", "avg_price", "city", "dept_code", "city_code",))

def get_mean_prices_by_dept_and_month(df, dept, month):
    department_data = df.filter((df["dept_code"] == dept))
    given_month_df = department_data.where(department_data["month"] == month)
    if given_month_df.count() > 0:
        return given_month
    latest_month = department_data.select(F.max("month").alias("latest_month")).head()["latest_month"]
    if latest_month is None:
        None
    return department_data.where(department_data["month"] == latest_month)

get_mean_prices_by_dept_and_month(df, 100, 11).show()

"""
+-----+-------------+-----------+---------+-------+---------+---------+
|month|property_type|postal_code|avg_price|   city|dept_code|city_code|
+-----+-------------+-----------+---------+-------+---------+---------+
|    3|        House|      YYYYX|     2361|NEAR_LA|      100|        6|
+-----+-------------+-----------+---------+-------+---------+---------+
"""

Solution 2:[2]

In your last case example shouldn't it return only month 3? Because this is the latest data month that exists. You could do this:

import pandas as pd

df = pd.DataFrame({
    'month': [11, 11, 2, 3, 11],
    'avg_price': [2834, 870, 732, 2361, 2162],
    'dept_code': [123, 123, 100, 100, 105]
})

def get_mean_prices_by_dept_and_month(df, dept, month):
    df2 = df[(df["dept_code"] == dept) & (df["month"] == month)]
    if df2.empty:
        # Here you can change 'month' to '12' if you want iterate backwards starting from last month of the year
        for i in range(month, 0, -1):
            df2 = df[(df["dept_code"] == dept) & (df["month"] == i)]
            if not df2.empty:
                break
    return df2
    
df_filter = get_mean_prices_by_dept_and_month(df, 100, 11)
print(df_filter)

Result:

   month  avg_price  dept_code
3      3       2361        100

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nithish
Solution 2