'Pyspark: Extract Json Objects from Array
I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to append an empty string for that record.
I will explain the method I'm trying to follow.
The dataset can come either looking like this:
{
"id":{
"$oid":1234
},
"orderDelivery":{
"deliveryFee":3500,
"deliveryInstructions":null,
"deliveryIssueIds":null
}
}
OR like that:
{
"id":{
"$oid":1233
},
"orderDelivery":{
"deliveryFee":3500,
"deliveryInstructions":[
{
"createdOn":1635762179390,
"deliveryMessage":"delivered"
},
{
"createdOn":1735762179390,
"deliveryMessage":"not door"
}
],
"deliveryIssueIds":null
}
}
if the deliveryInstructions is null then I want the table to look like this:
id | createdOn | deliveryMessage |
---|---|---|
1234 |
When the deliveryInstructions is not null then I want it to repeat for that id when I create a table:
id | createdOn | deliveryMessage |
---|---|---|
1233 | 163576217939 | delivered |
1233 | 1735762179390 | not door |
I tried implementing it like this but I didn't win:
if df[deliveryInstructions].isnull():
df_out.withColumn("CreatedOn", "").withColumn("deliveryMessage","")
else:
df2 = df.eplode(deliveryInstructions)
df_out.withColumn("createdOn", df2[CreatedOn])
df_out.withColumn("deliveryMessage", df2[deliveryMessage])
I am reading in df like this:
df= spark.read.option("multiline", "true").json(inputfile)
df_out has been defined like this, there are more fields in json but I am interested in adding the deliveryInstructions to df_out when it is null and when is it not as explained, I just need to add instructions since I find them complex:
df_out = df.select(
df["id"]["$oid"].cast(StringType()).alias("order_id")
)
Can anyone assist?
Solution 1:[1]
a.json
{"id":{ "$oid":1234},"orderDelivery":{ "deliveryFee":3500, "deliveryInstructions":null, "deliveryIssueIds":null}}
{"id":{ "$oid":1233},"orderDelivery":{ "deliveryFee":3500, "deliveryInstructions":[ { "createdOn":1635762179390, "deliveryMessage":"delivered" }, { "createdOn":1735762179390, "deliveryMessage":"not door" } ], "deliveryIssueIds":null }}
select nested elements
from pyspark.sql import functions as F
(df
.select(
F.col('id.$oid').alias('id'),
F.col('orderDelivery.deliveryFee'),
# I used explode_outer instead of explode to keep the nulls
F.explode_outer(F.col('orderDelivery.deliveryInstructions')).alias('deliveryInstructions'),
F.col('orderDelivery.deliveryIssueIds'),
)
.select(
F.col('id'),
F.col('deliveryFee'),
F.col('deliveryInstructions.*'),
F.col('deliveryIssueIds'),
)
.show(10, False)
)
+----+-----------+-------------+---------------+----------------+
|id |deliveryFee|createdOn |deliveryMessage|deliveryIssueIds|
+----+-----------+-------------+---------------+----------------+
|1234|3500 |null |null |null |
|1233|3500 |1635762179390|delivered |null |
|1233|3500 |1735762179390|not door |null |
+----+-----------+-------------+---------------+----------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | pltc |