'Spark Scala - Split DataFrame column into multiple depending on the size of the column

I need to split a column in several columns depending on the number of fields that each record has, for example, if I have the following DF:

+---+-------------------------------------------------+---+
|...|unique_code                                      |...|
+---+-------------------------------------------------+---+
|...|2022-12-31 00:00:00.000000000*_*AAAAA*_*000000000|...|
+---+-------------------------------------------------+---+
|...|2022-12-31 00:00:00.000000000*_*BBBB             |...|
+---+-------------------------------------------------+---+
|...|2022-12-31 00:00:00.000000000*_*CCC*_*1111*_*XX  |...|
+---+-------------------------------------------------+---+

I know that at most it is going to have 4 fields and at least 1, being always in the same order, which is the one in this list:

val uniqueCodeFields = List("col1", "col2", "col3", "col4")

Therefore the resulting DF would be the following:

+---+-----------------------------+-----+---------+----+---+
|...|col1                         |col2 |col3     |col4|...|
+---+-----------------------------+-----+---------+----+---+
|...|2022-12-31 00:00:00.000000000|AAAAA|000000000|NULL|   |
+---+-----------------------------+-----+---------+--- +---+
|...|2022-12-31 00:00:00.000000000|BBBB |NULL     |NULL|   |
+---+-----------------------------+-----+---------+--- +---+
|...|2022-12-31 00:00:00.000000000|CCC  |1111     |XX  |...|
+---+-----------------------------+-----+---------+----+---+

I developed this, based on https://stackoverflow.com/a/45972636/9025222

chgPivotedDF.withColumn("temp", split(col("unique_code"), "\\*_\\*")).select(
    (0 until size(col("temp"))).map(i => col("temp").getItem(i).as(uniqueCodeFields(i))): _*
)

But I am not being able to get the length of the "temp" column so as to only loop through the column to its limit in each case, getting the following error:

 error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Int
 (0 until col($"temp")).map(i => col("temp").getItem(i).as(uniqueCodeFields(i))): _*
             ^

any help is welcome, thanks!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source