'Pyspark: Extract Json Objects from Array

I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to append an empty string for that record.

I will explain the method I'm trying to follow.

The dataset can come either looking like this:

{
"id":{
    "$oid":1234
},
"orderDelivery":{
    "deliveryFee":3500,
    "deliveryInstructions":null,
    "deliveryIssueIds":null
   }
 }

OR like that:

  {
"id":{
    "$oid":1233
},
"orderDelivery":{
    "deliveryFee":3500,
    "deliveryInstructions":[
        {
            "createdOn":1635762179390,
            "deliveryMessage":"delivered"
        },
        {
            "createdOn":1735762179390,
            "deliveryMessage":"not door"
        }
    ],
    "deliveryIssueIds":null
  }
}

if the deliveryInstructions is null then I want the table to look like this:

id createdOn deliveryMessage
1234

When the deliveryInstructions is not null then I want it to repeat for that id when I create a table:

id createdOn deliveryMessage
1233 163576217939 delivered
1233 1735762179390 not door

I tried implementing it like this but I didn't win:

if df[deliveryInstructions].isnull():
    df_out.withColumn("CreatedOn", "").withColumn("deliveryMessage","")
else:
    df2 = df.eplode(deliveryInstructions)
    df_out.withColumn("createdOn", df2[CreatedOn])
    df_out.withColumn("deliveryMessage", df2[deliveryMessage])

I am reading in df like this:

df= spark.read.option("multiline", "true").json(inputfile)

df_out has been defined like this, there are more fields in json but I am interested in adding the deliveryInstructions to df_out when it is null and when is it not as explained, I just need to add instructions since I find them complex:

    df_out = df.select(

      df["id"]["$oid"].cast(StringType()).alias("order_id")
)

Can anyone assist?



Solution 1:[1]

a.json
{"id":{    "$oid":1234},"orderDelivery":{    "deliveryFee":3500,    "deliveryInstructions":null,    "deliveryIssueIds":null}}
{"id":{    "$oid":1233},"orderDelivery":{    "deliveryFee":3500,    "deliveryInstructions":[        {            "createdOn":1635762179390,            "deliveryMessage":"delivered"        },        {            "createdOn":1735762179390,            "deliveryMessage":"not door"        }    ],    "deliveryIssueIds":null  }}
select nested elements
from pyspark.sql import functions as F

(df
    .select(
        F.col('id.$oid').alias('id'),
        F.col('orderDelivery.deliveryFee'),
        # I used explode_outer instead of explode to keep the nulls
        F.explode_outer(F.col('orderDelivery.deliveryInstructions')).alias('deliveryInstructions'), 
        F.col('orderDelivery.deliveryIssueIds'),
    )
    .select(
        F.col('id'),
        F.col('deliveryFee'),
        F.col('deliveryInstructions.*'),
        F.col('deliveryIssueIds'),
    )
    .show(10, False)
)

+----+-----------+-------------+---------------+----------------+
|id  |deliveryFee|createdOn    |deliveryMessage|deliveryIssueIds|
+----+-----------+-------------+---------------+----------------+
|1234|3500       |null         |null           |null            |
|1233|3500       |1635762179390|delivered      |null            |
|1233|3500       |1735762179390|not door       |null            |
+----+-----------+-------------+---------------+----------------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pltc