'Pandas flattening nested jsons

so this is probably going to be a duplicate question but i'll make a try since I have not found anything.

I am trying to flatten a json with pandas, normal work. Looking at the examples of the docs here is the closest example for what I am trying to do:

data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
                                           ['info', 'governor']])
result
         name  population    state shortname info.governor
0        Dade       12345   Florida    FL    Rick Scott
1     Broward       40000   Florida    FL    Rick Scott
2  Palm Beach       60000   Florida    FL    Rick Scott
3      Summit        1234   Ohio       OH    John Kasich
4    Cuyahoga        1337   Ohio       OH    John Kasich

However, this example show us a way to get the data inside counties flatten with alongside the column state and shortname. Let's say that I have n number of columns at the root of each json object ( n number of state or shortname columns in the example above ). How do I include them all, in order to flat the counties but keep everything else that is adjacent?

First I tried things like these:

#None to treat data as a list of records
#Result of counties is still nested, not working
result = pd.json_normalize(data, None, ['counties'])

or


result = pd.json_normalize(data, None, ['counties', 'name'])

Then I Thought of getting the columns with dataframe.columns and reuse it since meta argument of json_normalize can take array of string.

But i'm stuck. and columns appear to return nested json attribute which I don't want to.

#still nested
cols = pd.json_normalize(data).columns.to_list()
#Exclude it because we already have  it 
cols = [index for index in cols if index != 'counties']
#remove nested columns if any
cols = [index for index in cols if "." not in index]

result = pd.json_normalize(data, 'counties', cols, errors="ignore")
#still nested

         name  population    state shortname  ...     other6     other7                                           counties info.governor
0        Dade       12345  Florida        FL  ...  dumb_data  dumb_data  [{'name': 'Dade', 'population': 12345}, {'name...           NaN
1     Broward       40000  Florida        FL  ...  dumb_data  dumb_data  [{'name': 'Dade', 'population': 12345}, {'name...           NaN
2  Palm Beach       60000  Florida        FL  ...  dumb_data  dumb_data  [{'name': 'Dade', 'population': 12345}, {'name...           NaN
3      Summit        1234     Ohio        OH  ...  dumb_data  dumb_data  [{'name': 'Summit', 'population': 1234}, {'nam...           NaN
4    Cuyahoga        1337     Ohio        OH  ...  dumb_data  dumb_data  [{'name': 'Summit', 'population': 1234}, {'nam...           NaN

I would prefere not to just harcode the column names since they change and that for this case I have 64 of them...

For better understanding, this is the real kind of data i'm working on from Woo Rest API. I am not using it here because its really long, but basically I am trying to flat line_items keeping only product_id inside it and of course all the other columns which is adjacent to line_items.



Solution 1:[1]

Okay so guys if you want to flatten a json and keeping everything else, you should used pd.Dataframe.explode()

Here is my logic:

import pandas as pd

data = [
        {'state': 'Florida',
        'shortname': 'FL',
        'info': {'governor': 'Rick Scott'},
        'counties': [
                      {'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}
        ]
        },
        {'state': 'Ohio',
        'shortname': 'OH',
        'info': {'governor': 'John Kasich'},
        'counties': [{'name': 'Summit', 'population': 1234},
        {'name': 'Cuyahoga', 'population': 1337}]}
]
#No Formating only converting to a Df
result = pd.json_normalize(data)

#Exploding the wanted nested column
exploded = result.explode('counties')

#Keeping the name only - this can be custom
exploded['countie_name'] = exploded['counties'].apply(lambda x: x['name'])

#Drop the used column since we took what interested us inside it.
exploded = exploded.drop(['counties'], axis=1)

print(exploded)
#Duplicate for Florida, as wanted with diferent countie names
     state shortname info.governor countie_name
0  Florida        FL    Rick Scott         Dade
0  Florida        FL    Rick Scott      Broward
0  Florida        FL    Rick Scott   Palm Beach
1     Ohio        OH   John Kasich       Summit
1     Ohio        OH   John Kasich     Cuyahoga

Imagine you have the content of a basket of product as a nested json, to explode the content of the basket while keeping the general basket attributes, you can do this.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Justin