'Convert deeply nested JSON to Pandas dataframe
My data in JSON format:
[
{
"UNIT": "UNIT1",
"PROJECTS": [
{
"PROJECT": "A",
"PERIODS": [
{
"PERIOD": "2019",
"TEAMS": [
{
"TEAM": "Team A",
"MEMBERS": [
{
"NAME": "FANNY",
"ID": 111
},
{
"NAME": "TANG",
"ID": 222
}
]
},
{
"TEAM": "Team B",
"MEMBERS": [
{
"NAME": "TIM",
"ID": 444
},
{
"NAME": "PAUL",
"ID": 555
}
]
}
]
}
]
},
{
"PROJECT": "B",
"PERIODS": [
{
"PERIOD": "2021",
"TEAMS": [
{
"TEAM": "Team A",
"MEMBERS": [
{
"NAME": "BENNY",
"ID": 121
},
{
"NAME": "JENNY",
"ID": 122
}
]
},
{
"TEAM": "Team B",
"MEMBERS": [
{
"NAME": "CHRIS",
"ID": 123
},
{
"NAME": "TANG",
"ID": 124
}
]
}
]
}
]
}
]
}
]
Expected output dataframe
UNIT PROJECT PERIOD NAME ID
0 UNIT1 A 2019 FANNY 111
1 UNIT1 A 2019 TANG 222
2 UNIT1 A 2019 TIM 444
3 UNIT1 A 2019 PAUL 555
4 UNIT1 B 2021 BENNY 121
5 UNIT1 B 2021 JENNY 122
6 UNIT1 B 2021 CHRIS 123
7 UNIT1 B 2021 TANG 124
I would like to store the data as formatted in the above JSON. In the future, the data structure could be big, so in order to make it nested, I choose the above way to store the data. However, I found it could be difficult to convert it back to dataframe.
The above JSON is deeply nested, I have tried pd.json_normalize but I cannot achieve the expected output.
Solution 1:[1]
You can use the Pandas library json_normalize function with some parameters.
It should like something like this
df = pd.json_normalize(
name_of_the_file,
meta=[
'unit',
['unit', 'projects', 'project'],
['unit', 'projects', 'periods', 'period'],
['unit', 'projects', 'periods', 'teams', 'members', 'name'],
['unit', 'projects', 'periods', 'teams', 'members', 'id']
]
)
Into the meta parameter you should write the json path of each field which you want to show in your pandas dataframe.
Solution 2:[2]
You can do this and the advantage of this solution is that you never need to care about the paths in your json.
data = [
{
"UNIT": "UNIT1",
"PROJECTS": [
{
"PROJECT": "A",
"PERIODS": [
{
"PERIOD": "2019",
"TEAMS": [
{
"TEAM": "Team A",
"MEMBERS": [
{
"NAME": "FANNY",
"ID": 111
},
{
"NAME": "TANG",
"ID": 222
}
]
},
{
"TEAM": "Team B",
"MEMBERS": [
{
"NAME": "TIM",
"ID": 444
},
{
"NAME": "PAUL",
"ID": 555
}
]
}
]
}
]
},
{
"PROJECT": "B",
"PERIODS": [
{
"PERIOD": "2021",
"TEAMS": [
{
"TEAM": "Team A",
"MEMBERS": [
{
"NAME": "BENNY",
"ID": 121
},
{
"NAME": "JENNY",
"ID": 122
}
]
},
{
"TEAM": "Team B",
"MEMBERS": [
{
"NAME": "CHRIS",
"ID": 123
},
{
"NAME": "TANG",
"ID": 124
}
]
}
]
}
]
}
]
}
]
Defined the following function (it works on any json):
import json
import pandas as pd
def flatten_nested_json_df(df):
df = df.reset_index()
s = (df.applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df.applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
while len(list_columns) > 0 or len(dict_columns) > 0:
new_columns = []
for col in dict_columns:
horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
horiz_exploded.index = df.index
df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
new_columns.extend(horiz_exploded.columns) # inplace
for col in list_columns:
#print(f"exploding: {col}")
df = df.drop(columns=[col]).join(df[col].explode().to_frame())
new_columns.append(col)
s = (df[new_columns].applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df[new_columns].applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
return df
and then do the following:
jsonString = json.dumps(data) #What you gave is not really a json string but a list
jsonstr = json.loads(jsonString)
results = pd.json_normalize(jsonstr)
df = pd.DataFrame(results)
outdf = flatten_nested_json_df(df)
which returns:
index UNIT PROJECTS.PROJECT PROJECTS.PERIODS.PERIOD \
0 0 UNIT1 A 2019
0 0 UNIT1 A 2019
0 0 UNIT1 A 2019
0 0 UNIT1 A 2019
0 0 UNIT1 A 2019
.. ... ... ... ...
0 0 UNIT1 B 2021
0 0 UNIT1 B 2021
0 0 UNIT1 B 2021
0 0 UNIT1 B 2021
0 0 UNIT1 B 2021
PROJECTS.PERIODS.TEAMS.TEAM PROJECTS.PERIODS.TEAMS.MEMBERS.NAME \
0 Team A FANNY
0 Team A TANG
0 Team A TIM
0 Team A PAUL
0 Team A BENNY
.. ... ...
0 Team B PAUL
0 Team B BENNY
0 Team B JENNY
0 Team B CHRIS
0 Team B TANG
PROJECTS.PERIODS.TEAMS.MEMBERS.ID
0 111
0 222
0 444
0 555
0 121
.. ...
0 555
0 121
0 122
0 123
0 124
[2048 rows x 7 columns]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | LMark |
| Solution 2 |
