'How to process Json file with column headers and data separated in different lists in spark
I have Json file as follows:
{"columns": ['Name', 'City', 'DOB'],
"data":[
['ABC', 'Georgia', '01/05/1987'],
['ABC', 'Kansas', '10/11/1989']]}
How can this file be processed properly using pyspark so that it will get loaded into table containing name, city and dob in a proper format. Will I have to first transform it into the usual json format and then proceed ahead with json load or is there any other way to handle this
Solution 1:[1]
I tried with the below approach and it works. Not sure if its the best way, I'm open for suggestions or improvements.
data = {"columns": ['Name', 'City', 'DOB'],
"data":[
['ABC', 'Georgia', '01/05/1987'],
['ABC', 'Kansas', '10/11/1989']
]
}
# Converting to json string and then reading it into a pandas dataframe
import json
import pandas as pd
json_data = json.dumps(data)
pandasDF = pd.read_json(json_data, orient='split')
pandasDF
# Converting pandas dataframe to spark dataframe as below -
df = spark.createDataFrame(pandasDF)
df.show(truncate=False)
+----+-------+----------+
|Name| City| DOB|
+----+-------+----------+
| ABC|Georgia|01/05/1987|
| ABC| Kansas|10/11/1989|
+----+-------+----------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | DKNY |
