'Get first product ordered by customer

I have an existing table like below. I want to replace the NULLs in first_product column with the first product a customer has ordered.

INPUT

customer_id product order_date_id first_product
C0001 apple 20220224 NULL
C0001 pear 20220101 NULL
C0002 strawberry 20220224 NULL
C0001 apple 20220206 NULL

OUTPUT:

customer_id product order_date_id first_product
C0001 apple 20220224 pear
C0001 pear 20220101 pear
C0002 strawberry 20220224 strawberry
C0001 apple 20220206 pear

I have thought about using row numbers as below, but not sure how to pull it all together. I have this code so far, but not sure how to update the first_product column using the below code.

ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY customer_id, order_date_id) AS first_occurrance

Some pseudo-code:

REPLACE first_product FROM table WITH product WHERE
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY customer_id, order_date_id) AS first_occurrance =  1 


Solution 1:[1]

Hey you can use first window function and achieve this.

val cust_data = Seq[(String, String, Int, String)](
                            ("C0001",   "apple",    20220224,   null),
                            ("C0001",   "pear",    20220101,    null),
                            ("C0002",   "strawberry",   20220224,   null),
                            ("C0001",   "apple",    20220206,   null)
                   ).toDF("cust_id", "product", "date_id", "first_prod")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

val out_df = cust_data.withColumn("first_prod", first($"product").over(Window.partitionBy($"cust_id").orderBy($"date_id")))


out_df.show()

+-------+----------+--------+----------+
|cust_id|   product| date_id|first_prod|
+-------+----------+--------+----------+
|  C0001|      pear|20220101|      pear|
|  C0001|     apple|20220206|      pear|
|  C0001|     apple|20220224|      pear|
|  C0002|strawberry|20220224|strawberry|
+-------+----------+--------+----------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Pradeep yadav