'Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?

from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .master("local[1]")
    .appName("sample")
    .getOrCreate())

df = (spark.read.option("multiline", "true")
    .json("data/origin/crops.json"))

df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))

(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))

root
 |-- history: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- finished_at: string (nullable = true)
 |    |    |-- started_at: string (nullable = true)
 |-- id: long (nullable = true)
 |-- trial: struct (nullable = true)
 |    |-- is_trial: boolean (nullable = true)
 |    |-- ws10_max: double (nullable = true)

I'm trying to return something like this

started_at finished_at is_trial ws10_max
First row row
Second row row

Thank you!



Solution 1:[1]

Use explode on array and select("struct.*") on struct.

df.select("trial", "id", explode('history').alias('history')),
  .select('id', 'history.*', 'trial.*'))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gohmz