'Keeping nested columns a PySpark dataframe
Suppose we have a dataframe df with the following schema:
|-- a: decimal(10,2) (nullable = true)
|-- b: decimal(10,2) (nullable = true)
|-- c: decimal(10,2) (nullable = true)
|-- d: decimal(10,2) (nullable = true)
|-- array_a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a1: string (nullable = true)
| | |-- a2: integer (nullable = true)
| | |-- a3: string (nullable = true)
|-- array_b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b1: string (nullable = true)
| | |-- b2: integer (nullable = true)
| | |-- b3: string (nullable = true)
|-- e: decimal(10,2) (nullable = true)
|-- f: decimal(10,2) (nullable = true)
|-- g: decimal(10,2) (nullable = true)
How do you keep the elements of array_a when exploding out array_b? This code doesn't seem to keep array_a:
from pyspark.sql.functions import explode_outer
cols = df.columns
df_a = df.select(*cols, explode_outer("array_a")).select(*cols, "col.*")
cols_a = df_a.columns
cols.extend(cols_a)
df_b = df.select(*cols, explode_outer("array_b")).select(*cols, "col.*")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
