'pyspark json nested generate field name

suppose i am having the following json

{
"filename": "orderDetails",
"datasets": [
{
"orderId": "ord1001",
"customerId": "cust5001",
"orderDate": "2021-12-24 00.00.00.000",
"shipmentDetails": {
"street": "M.G.Road",
"city": "Delhi",
"state": "New Delhi",
"postalCode": "110040",
"country": "India"
},
"orderDetails": [
{
"productId": "prd9001",
"quantity": 2,
"sequence": 1,
"totalPrice": {
"gross": 550,
"net": 500,
"tax": 50
}
},
{
"productId": "prd9002",
"quantity": 3,
"sequence": 2,
"totalPrice": {
"gross": 300,
"net": 240,
"tax": 60
}
}
]
}

I would like to read the json to spark dataframe with name

filename,filename_datasets,filename_datasets_orderID.........

filename_orderDetails_productID,filename_orderDetails_quatity ...

what's a good way to do that ? Can I first generate my custom field name from the json schema itself ??



Solution 1:[1]

Firstly, there is no a straight forward easy way to do it. The guidelines to what you requesting are as follow:

  1. You have to create a struct with all of your relevant fields.
  2. cast your datasets field with your relevant schema using the definition at point 1.
  3. explode it using explode method from (pyspark.sql.functions)
  4. explode again the relevant field orderDetails
  5. Select the relevant requested columns using a regular select statement.
  6. done.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Benny Elgazar