'How to parse a pandas column of JSON content efficiently?

Let's say I have the following DataFrame, where the data column contains a nested JSON string that I want to parse into separate columns:

import pandas as pd

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)


df

I've found the json_normalize function, and am currently parsing the JSON in a list comprehension; the result is correct, but this takes long. 1000 rows take 1-2 seconds, and I have about a million rows in my real run:

import json
from pandas.io.json import json_normalize

parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])

parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['user_id']

parsed_df

Is there a faster way to parse this data into a nice-looking DataFrame?

Solution 1:^[1]

To also automatically get the column names for the dataframe use this:

parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']], columns=json_normalize(json.loads(js)).keys().tolist(), index=df.index)

Solution 2:^[2]

Assuming that the JSON data is available in one big chunk rather than split up into individual strings, then using json.loads, iterating through the results and creating dicts, and finally creating a DataFrame on the list of dicts works pretty well. With 150,000 rows, 30 original columns and 6 columns to be extracted into a new DataFrame, it completes in less than 1 second.

For example:

x = json.loads('[\
{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"},\
{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"},\
{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"},\
{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}]')
load_items = []
for item in x:
  load_items.append({
    'uid': item['uid'],
    'account_type': item['account_type'],
    'currency_current': item['account_data']['currency']['current'],
    'currency_minimum': item['account_data']['currency']['minimum'],
    'fees_monthly': item['account_data']['fees']['monthly'],
    'user_name': item['user_name'], 
  })
y = pd.DataFrame(load_items)
y

index	uid	account_type	currency_current	currency_minimum	fees_monthly	user_name
0	100	1	1000	-500	13.5	Alice
1	100	2	2000	0	0.0	Alice
2	200	1	3000	0	13.5	Bob
3	300	1	4000	0	13.5	Carol

(*thanks to Jupyter/Colab for the markdown table)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Oliver K
Solution 2	There

'How to parse a pandas column of JSON content efficiently?

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]