'Convert deeply nested JSON to Pandas dataframe

My data in JSON format:

[
    {
        "UNIT": "UNIT1",
        "PROJECTS": [
            {
                "PROJECT": "A",
                "PERIODS": [
                    {
                        "PERIOD": "2019",
                        "TEAMS": [
                            {
                                "TEAM": "Team A",
                                "MEMBERS": [
                                    {
                                        "NAME": "FANNY",
                                        "ID": 111
                                    },
                                    {
                                        "NAME": "TANG",
                                        "ID": 222
                                    }
                                ]
                            },
                            {
                                "TEAM": "Team B",
                                "MEMBERS": [
                                    {
                                        "NAME": "TIM",
                                        "ID": 444
                                    },
                                    {
                                        "NAME": "PAUL",
                                        "ID": 555
                                    }
                                ]
                            }
                        ]
                    }
                ]
            },
            {
                "PROJECT": "B",
                "PERIODS": [
                    {
                        "PERIOD": "2021",
                        "TEAMS": [
                            {
                                "TEAM": "Team A",
                                "MEMBERS": [
                                    {
                                        "NAME": "BENNY",
                                        "ID": 121
                                    },
                                    {
                                        "NAME": "JENNY",
                                        "ID": 122
                                    }
                                ]
                            },
                            {
                                "TEAM": "Team B",
                                "MEMBERS": [
                                    {
                                        "NAME": "CHRIS",
                                        "ID": 123
                                    },
                                    {
                                        "NAME": "TANG",
                                        "ID": 124
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }
        ]
    }
]

Expected output dataframe

    UNIT PROJECT PERIOD   NAME   ID
0  UNIT1       A   2019  FANNY  111
1  UNIT1       A   2019   TANG  222
2  UNIT1       A   2019    TIM  444
3  UNIT1       A   2019   PAUL  555
4  UNIT1       B   2021  BENNY  121
5  UNIT1       B   2021  JENNY  122
6  UNIT1       B   2021  CHRIS  123
7  UNIT1       B   2021   TANG  124

I would like to store the data as formatted in the above JSON. In the future, the data structure could be big, so in order to make it nested, I choose the above way to store the data. However, I found it could be difficult to convert it back to dataframe.

The above JSON is deeply nested, I have tried pd.json_normalize but I cannot achieve the expected output.



Solution 1:[1]

You can use the Pandas library json_normalize function with some parameters.

It should like something like this

df = pd.json_normalize(
    name_of_the_file,
    meta=[
        'unit',
        ['unit', 'projects', 'project'],
        ['unit', 'projects', 'periods', 'period'],
        ['unit', 'projects', 'periods', 'teams', 'members', 'name'],
        ['unit', 'projects', 'periods', 'teams', 'members', 'id']
    ]
)

Into the meta parameter you should write the json path of each field which you want to show in your pandas dataframe.

Solution 2:[2]

You can do this and the advantage of this solution is that you never need to care about the paths in your json.


data = [
    {
        "UNIT": "UNIT1",
        "PROJECTS": [
            {
                "PROJECT": "A",
                "PERIODS": [
                    {
                        "PERIOD": "2019",
                        "TEAMS": [
                            {
                                "TEAM": "Team A",
                                "MEMBERS": [
                                    {
                                        "NAME": "FANNY",
                                        "ID": 111
                                    },
                                    {
                                        "NAME": "TANG",
                                        "ID": 222
                                    }
                                ]
                            },
                            {
                                "TEAM": "Team B",
                                "MEMBERS": [
                                    {
                                        "NAME": "TIM",
                                        "ID": 444
                                    },
                                    {
                                        "NAME": "PAUL",
                                        "ID": 555
                                    }
                                ]
                            }
                        ]
                    }
                ]
            },
            {
                "PROJECT": "B",
                "PERIODS": [
                    {
                        "PERIOD": "2021",
                        "TEAMS": [
                            {
                                "TEAM": "Team A",
                                "MEMBERS": [
                                    {
                                        "NAME": "BENNY",
                                        "ID": 121
                                    },
                                    {
                                        "NAME": "JENNY",
                                        "ID": 122
                                    }
                                ]
                            },
                            {
                                "TEAM": "Team B",
                                "MEMBERS": [
                                    {
                                        "NAME": "CHRIS",
                                        "ID": 123
                                    },
                                    {
                                        "NAME": "TANG",
                                        "ID": 124
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }
        ]
    }
]

Defined the following function (it works on any json):

import json
import pandas as pd
def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            #print(f"exploding: {col}")
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

and then do the following:

jsonString = json.dumps(data)  #What you gave is not really a json string but a list
jsonstr = json.loads(jsonString)
results = pd.json_normalize(jsonstr)
df = pd.DataFrame(results)

outdf = flatten_nested_json_df(df)

which returns:

   index   UNIT PROJECTS.PROJECT PROJECTS.PERIODS.PERIOD  \
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
..    ...    ...              ...                     ...   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   

   PROJECTS.PERIODS.TEAMS.TEAM PROJECTS.PERIODS.TEAMS.MEMBERS.NAME  \
0                       Team A                               FANNY   
0                       Team A                                TANG   
0                       Team A                                 TIM   
0                       Team A                                PAUL   
0                       Team A                               BENNY   
..                         ...                                 ...   
0                       Team B                                PAUL   
0                       Team B                               BENNY   
0                       Team B                               JENNY   
0                       Team B                               CHRIS   
0                       Team B                                TANG   

    PROJECTS.PERIODS.TEAMS.MEMBERS.ID  
0                                 111  
0                                 222  
0                                 444  
0                                 555  
0                                 121  
..                                ...  
0                                 555  
0                                 121  
0                                 122  
0                                 123  
0                                 124  

[2048 rows x 7 columns]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 LMark
Solution 2