'Interpreting dataframe out of Array JSON in Spark Structured Streaming

I am having a column of below json string in my dataframe how can I explode/flatten it to get single level dataframe ? Currently the schema is

df
|-json_data (StringType)

how can I have a df of below schema ?

df
|-key1
|-key2_signal
|-key2_value

[{
    "key1": 1647336730000,
    "key2": [
      {
        "signal": "signal_key_1",
        "value": 73.6
    },
      {
        "signal": "signal_key_2",
        "value": 3.375
    },
      {
        "signal": "signal_key_3",
        "value": 13.82
    }]
}]


Solution 1:[1]

I guess it's how you should do it.

Explanation:

  1. Creating the dummy for practice on your sample data. (df1 creation)
  2. Take the string column and cast it to be an actual JSON.
  3. Select all of the fields.
  4. Explode key2 since it's a list.
  5. select all the relevant keys from within key2
  6. select key1 and key2 related data. using asterisk (*)
import pyspark.sql.functions as f
from pyspark.sql.types import *
import json
df1 = spark.createDataFrame(
    [{'json_data': [json.dumps({
        "key1": 1647336730000,
        "key2": [
            {
                "signal": "signal_key_1",
                "value": 73.6
            },
            {
                "signal": "signal_key_2",
                "value": 3.375
            },
            {
                "signal": "signal_key_3",
                "value": 13.82
            }]
    })]}],
    schema=StructType([StructField('json_data', StringType(), True)])
)
(
    df1
    .withColumn('actual', f.from_json(f.col('json_data'), f.schema_of_json(f.lit(df1.select(f.col("json_data")).first()[0]))))
    .withColumn('data', f.explode('actual'))
    .drop('actual')
    .withColumn('key2', f.explode('data.key2'))
    .drop('json_data')
    .select('data.key1', 'key2.*').show()
    )
# +-------------+------------+-----+
# |         key1|      signal|value|
# +-------------+------------+-----+
# |1647336730000|signal_key_1| 73.6|
# |1647336730000|signal_key_2|3.375|
# |1647336730000|signal_key_3|13.82|
# +-------------+------------+-----+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1