'Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF

val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))

val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")

This is the sample DF

+---+-----------+-----------+
| id|    include|    exclude|
+---+-----------+-----------+
|  1|in1,in2,in3|       null|
|  2|    in4,in5|ex1,ex2,ex3|
|  3|       null|    ex4,ex5|
|  4|       null|       null|
+---+-----------+-----------+

which I'm trying to transform into

+---+----+---+
| id|type|col|
+---+----+---+
|  1|incl|in1|
|  1|incl|in2|
|  1|incl|in3|
|  2|incl|in4|
|  2|incl|in5|
|  2|excl|ex1|
|  2|excl|ex2|
|  2|excl|ex3|
|  3|excl|ex4|
|  3|excl|ex5|
+---+----+---+

EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.

I can get it to work with union, as so:

df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
    df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)

but I was wondering if this was possible to do without using union.



Solution 1:[1]

The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.

This might be a long process.

With cte as ( select id, include+exclude as outputcol from SQL), Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte) Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes Where finalcol is not null

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Srishuk Kumar Bhagwat