'Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

This question although may seem previously answered it is not. All transposing seem to relate to one column and pivoting the data in that column. I want to make a vertical table from a horizontal set of columns, for example:-

Take this example:-

MyPrimaryKey	Insurer_Factor_1_Name	Insurer_Factor_1_Code	Insurer_Factor_1_Value	Insurer_Factor_2_Name	Insurer_Factor_2_Code	Insurer_Factor_2_Value	Insurer_Factor_[n]_Name	Insurer_Factor_[n]_Code	Insurer_Factor_[n]_Value
XX-ABCDEF-1234-ABCDEF123	Special	SP1	2500	Awesome	AW2	3500	ecetera	etc	999999

[n] being any number of iterations

transforming it into a new vertical representation dataframe:-

MyPrimaryKey	Insurer_Factor_ID	Insurer_Factor_Name	Insurer_Factor_Code	Insurer_Factor_Value
XX-ABCDEF-1234-ABCDEF123	1	Special	SP1	2500
XX-ABCDEF-1234-ABCDEF123	2	Awesome	AW2	3500
XX-ABCDEF-1234-ABCDEF123	[n]	ecetera	etc	999999

There is also the possibility that the “Code” column may be missing and we only receive the name and value therefore requiring null to be added to the code column.

I've searched High and low for this, but there just doesn't seem to be anything out there?

Also there could be many rows in the first example...

Solution 1:^[1]

The reason you haven't found it is that there is not a magic trick to move a 'interstingly' designed table into a well designed table. You are going to have to hand code a query to either union the rows into your table, or select arrays that you then explode.

Sure you could probably write some code to generate the SQL that you want, but really they're isn't a good feature to magically translate this feature format into a row based format.

Solution 2:^[2]

In order of preference:

Revisit your decision to send multiple files: It sounds like it would save a lot of work if you just sent multiple files.

Change the column schema: Put a delimiter (every 4th column) into the column schema allowing us to see the rows. We can then suck the file in as rows. Using a delimiter.

Write your own custom datasource: You can use the existing text one as a example for how you can write your own, that cold interpret every 3 columns as a row.

Write a custom UDF that takes all columns as a parameter and returns an array of rows, that you then call explode on to turn them into rows. This will be slow so I give it to you as the final option.

Solution 3:^[3]

*** WARNING *** This is going to use up a lot of memory. with 6000 rows it will be slow and may run out of memory. If it works great but I suggest you code your own data source as that likely is a better/faster strategy.

If you want to do this with a UDF and you are only doing this with a couple of row you can do this:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

/* spark.sql("select * from info").show();
+----+-------+----+
|type|db_type|info|
+----+-------+----+
| bot|  x_bot|   x|
| bot|  x_bnt|   x|
| per|   xper|   b|
+----+-------+----+ */
 val schema = ArrayType(new StructType().add("name","string").add("info","string"))
val myUDF = udf((s: Row) => {
    Seq( Row( s.get(0).toString, s.get(1).toString ), Row(s.get(2).toString, s.get(2).toString ) )
},schema)
val records = spark.sql("select * from info");
val arrayRecords = records.select( myUDF(struct(records.columns.map(records(_)) : _*)).alias("Arrays") )
arrayRecords.select( explode(arrayRecords("Arrays")).alias("myCol") )
.select( col("myCol.*").show()

+----+-----+
|name| info|
+----+-----+
| bot|x_bot|
|   x|    x|
| bot|x_bnt|
|   x|    x|
| per| xper|
|   b|    b|
+----+-----+

Sudo code
Create schema for rows.
create udf (with schema) (here I only show small manipulation but you can obviously use more complicated logic in your case)
select data,
Apply udf,
Explode Array.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Matt Andruff
Solution 2	Matt Andruff
Solution 3

'Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]