'How to parse a pandas column of JSON content efficiently?

Let's say I have the following DataFrame, where the data column contains a nested JSON string that I want to parse into separate columns:

import pandas as pd

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)


df

I've found the json_normalize function, and am currently parsing the JSON in a list comprehension; the result is correct, but this takes long. 1000 rows take 1-2 seconds, and I have about a million rows in my real run:

import json
from pandas.io.json import json_normalize

parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])

parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['user_id']

parsed_df

Is there a faster way to parse this data into a nice-looking DataFrame?



Solution 1:[1]

To also automatically get the column names for the dataframe use this:

parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']], columns=json_normalize(json.loads(js)).keys().tolist(), index=df.index)

Solution 2:[2]

Assuming that the JSON data is available in one big chunk rather than split up into individual strings, then using json.loads, iterating through the results and creating dicts, and finally creating a DataFrame on the list of dicts works pretty well. With 150,000 rows, 30 original columns and 6 columns to be extracted into a new DataFrame, it completes in less than 1 second.

For example:

x = json.loads('[\
{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"},\
{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"},\
{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"},\
{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}]')
load_items = []
for item in x:
  load_items.append({
    'uid': item['uid'],
    'account_type': item['account_type'],
    'currency_current': item['account_data']['currency']['current'],
    'currency_minimum': item['account_data']['currency']['minimum'],
    'fees_monthly': item['account_data']['fees']['monthly'],
    'user_name': item['user_name'], 
  })
y = pd.DataFrame(load_items)
y
index uid account_type currency_current currency_minimum fees_monthly user_name
0 100 1 1000 -500 13.5 Alice
1 100 2 2000 0 0.0 Alice
2 200 1 3000 0 13.5 Bob
3 300 1 4000 0 13.5 Carol

(*thanks to Jupyter/Colab for the markdown table)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Oliver K
Solution 2 There