'Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks
This question although may seem previously answered it is not. All transposing seem to relate to one column and pivoting the data in that column. I want to make a vertical table from a horizontal set of columns, for example:-
Take this example:-
| MyPrimaryKey | Insurer_Factor_1_Name | Insurer_Factor_1_Code | Insurer_Factor_1_Value | Insurer_Factor_2_Name | Insurer_Factor_2_Code | Insurer_Factor_2_Value | Insurer_Factor_[n]_Name | Insurer_Factor_[n]_Code | Insurer_Factor_[n]_Value |
|---|---|---|---|---|---|---|---|---|---|
| XX-ABCDEF-1234-ABCDEF123 | Special | SP1 | 2500 | Awesome | AW2 | 3500 | ecetera | etc | 999999 |
[n] being any number of iterations
transforming it into a new vertical representation dataframe:-
| MyPrimaryKey | Insurer_Factor_ID | Insurer_Factor_Name | Insurer_Factor_Code | Insurer_Factor_Value |
|---|---|---|---|---|
| XX-ABCDEF-1234-ABCDEF123 | 1 | Special | SP1 | 2500 |
| XX-ABCDEF-1234-ABCDEF123 | 2 | Awesome | AW2 | 3500 |
| XX-ABCDEF-1234-ABCDEF123 | [n] | ecetera | etc | 999999 |
There is also the possibility that the “Code” column may be missing and we only receive the name and value therefore requiring null to be added to the code column.
I've searched High and low for this, but there just doesn't seem to be anything out there?
Also there could be many rows in the first example...
Solution 1:[1]
The reason you haven't found it is that there is not a magic trick to move a 'interstingly' designed table into a well designed table. You are going to have to hand code a query to either union the rows into your table, or select arrays that you then explode.
Sure you could probably write some code to generate the SQL that you want, but really they're isn't a good feature to magically translate this feature format into a row based format.
Solution 2:[2]
In order of preference:
Revisit your decision to send multiple files: It sounds like it would save a lot of work if you just sent multiple files.
Change the column schema: Put a delimiter (every 4th column) into the column schema allowing us to see the rows. We can then suck the file in as rows. Using a delimiter.
Write your own custom datasource: You can use the existing text one as a example for how you can write your own, that cold interpret every 3 columns as a row.
Write a custom UDF that takes all columns as a parameter and returns an array of rows, that you then call explode on to turn them into rows. This will be slow so I give it to you as the final option.
Solution 3:[3]
*** WARNING *** This is going to use up a lot of memory. with 6000 rows it will be slow and may run out of memory. If it works great but I suggest you code your own data source as that likely is a better/faster strategy.
If you want to do this with a UDF and you are only doing this with a couple of row you can do this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
/* spark.sql("select * from info").show();
+----+-------+----+
|type|db_type|info|
+----+-------+----+
| bot| x_bot| x|
| bot| x_bnt| x|
| per| xper| b|
+----+-------+----+ */
val schema = ArrayType(new StructType().add("name","string").add("info","string"))
val myUDF = udf((s: Row) => {
Seq( Row( s.get(0).toString, s.get(1).toString ), Row(s.get(2).toString, s.get(2).toString ) )
},schema)
val records = spark.sql("select * from info");
val arrayRecords = records.select( myUDF(struct(records.columns.map(records(_)) : _*)).alias("Arrays") )
arrayRecords.select( explode(arrayRecords("Arrays")).alias("myCol") )
.select( col("myCol.*").show()
+----+-----+
|name| info|
+----+-----+
| bot|x_bot|
| x| x|
| bot|x_bnt|
| x| x|
| per| xper|
| b| b|
+----+-----+
- Sudo code
- Create schema for rows.
- create udf (with schema) (here I only show small manipulation but you can obviously use more complicated logic in your case)
- select data,
- Apply udf,
- Explode Array.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Matt Andruff |
| Solution 2 | Matt Andruff |
| Solution 3 |
