'Keeping nested columns a PySpark dataframe

Suppose we have a dataframe df with the following schema:

 |-- a: decimal(10,2) (nullable = true)
 |-- b: decimal(10,2) (nullable = true)
 |-- c: decimal(10,2) (nullable = true)
 |-- d: decimal(10,2) (nullable = true)
 |-- array_a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a1: string (nullable = true)
 |    |    |-- a2: integer (nullable = true)
 |    |    |-- a3: string (nullable = true)
 |-- array_b: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b1: string (nullable = true)
 |    |    |-- b2: integer (nullable = true)
 |    |    |-- b3: string (nullable = true)
  |-- e: decimal(10,2) (nullable = true)
  |-- f: decimal(10,2) (nullable = true)
  |-- g: decimal(10,2) (nullable = true)

How do you keep the elements of array_a when exploding out array_b? This code doesn't seem to keep array_a:

from pyspark.sql.functions import explode_outer
cols = df.columns
df_a = df.select(*cols, explode_outer("array_a")).select(*cols, "col.*")
cols_a = df_a.columns
cols.extend(cols_a)
df_b = df.select(*cols, explode_outer("array_b")).select(*cols, "col.*")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Keeping nested columns a PySpark dataframe

Sources

Related Questions