'pyspark json nested generate field name
suppose i am having the following json
{
"filename": "orderDetails",
"datasets": [
{
"orderId": "ord1001",
"customerId": "cust5001",
"orderDate": "2021-12-24 00.00.00.000",
"shipmentDetails": {
"street": "M.G.Road",
"city": "Delhi",
"state": "New Delhi",
"postalCode": "110040",
"country": "India"
},
"orderDetails": [
{
"productId": "prd9001",
"quantity": 2,
"sequence": 1,
"totalPrice": {
"gross": 550,
"net": 500,
"tax": 50
}
},
{
"productId": "prd9002",
"quantity": 3,
"sequence": 2,
"totalPrice": {
"gross": 300,
"net": 240,
"tax": 60
}
}
]
}
I would like to read the json to spark dataframe with name
filename,filename_datasets,filename_datasets_orderID.........
filename_orderDetails_productID,filename_orderDetails_quatity ...
what's a good way to do that ? Can I first generate my custom field name from the json schema itself ??
Solution 1:[1]
Firstly, there is no a straight forward easy way to do it. The guidelines to what you requesting are as follow:
- You have to create a struct with all of your relevant fields.
- cast your datasets field with your relevant schema using the definition at point 1.
- explode it using
explodemethod from (pyspark.sql.functions) - explode again the relevant field
orderDetails - Select the relevant requested columns using a regular select statement.
- done.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Benny Elgazar |
