'How to parse a pandas column of JSON content efficiently?
Let's say I have the following DataFrame, where the data column contains a nested JSON string that I want to parse into separate columns:
import pandas as pd
df = pd.DataFrame({
'bank_account': [101, 102, 201, 301],
'data': [
'{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
'{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0}, "fees": {"monthly": 0}}, "user_name": "Alice"}',
'{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Bob"}',
'{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Carol"}'
]},
index = ['Alice', 'Alice', 'Bob', 'Carol']
)
df
I've found the json_normalize function, and am currently parsing the JSON in a list comprehension; the result is correct, but this takes long. 1000 rows take 1-2 seconds, and I have about a million rows in my real run:
import json
from pandas.io.json import json_normalize
parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])
parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['user_id']
parsed_df
Is there a faster way to parse this data into a nice-looking DataFrame?
Solution 1:[1]
To also automatically get the column names for the dataframe use this:
parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']], columns=json_normalize(json.loads(js)).keys().tolist(), index=df.index)
Solution 2:[2]
Assuming that the JSON data is available in one big chunk rather than split up into individual strings, then using json.loads, iterating through the results and creating dicts, and finally creating a DataFrame on the list of dicts works pretty well. With 150,000 rows, 30 original columns and 6 columns to be extracted into a new DataFrame, it completes in less than 1 second.
For example:
x = json.loads('[\
{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"},\
{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0}, "fees": {"monthly": 0}}, "user_name": "Alice"},\
{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Bob"},\
{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Carol"}]')
load_items = []
for item in x:
load_items.append({
'uid': item['uid'],
'account_type': item['account_type'],
'currency_current': item['account_data']['currency']['current'],
'currency_minimum': item['account_data']['currency']['minimum'],
'fees_monthly': item['account_data']['fees']['monthly'],
'user_name': item['user_name'],
})
y = pd.DataFrame(load_items)
y
| index | uid | account_type | currency_current | currency_minimum | fees_monthly | user_name |
|---|---|---|---|---|---|---|
| 0 | 100 | 1 | 1000 | -500 | 13.5 | Alice |
| 1 | 100 | 2 | 2000 | 0 | 0.0 | Alice |
| 2 | 200 | 1 | 3000 | 0 | 13.5 | Bob |
| 3 | 300 | 1 | 4000 | 0 | 13.5 | Carol |
(*thanks to Jupyter/Colab for the markdown table)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Oliver K |
| Solution 2 | There |
