'Parsing a JSON string which was loaded from a CSV using Pandas
I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:
name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"
After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?
After about an hour, the only thing I could come up with was:
import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))
This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.
Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:
df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df
Out[14]:
name dob eye_color height weight
0 john smith 1/1/1980 brown 160 76
1 dave jones 2/2/1981 blue 170 85
2 bob roberts 3/3/1982 green 180 94
Solution 1:[1]
I think applying the json.load is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:
stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)
or alternatively in one step:
df.join(df['stats'].apply(json.loads).apply(pd.Series))
Solution 2:[2]
Option 1
If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})
Option 2
If you didn't then you might need to use this:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})
Option 3
For more complicated situations you can write a custom converter like this:
import json
import pandas as pd
def parse_column(data):
try:
return json.loads(data)
except Exception as e:
print(e)
return None
df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})
Solution 3:[3]
Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)
We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
Solution 4:[4]
json_normalize function in pandas.io.json package helps to do this without using custom function.
(assuming you are loading the data from a file)
from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)
Solution 5:[5]
- If you have DateTime values in your .csv file,
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)will mess up the date time values - This link has some tip how to read the csv file with json strings into the dataframe.
You could do the following to read csv file with json string column and convert your json string into columns.
Read your csv into the dataframe (
read_df)read_df = pd.read_csv('yourFile.csv', converters={'state':json.loads}, header=0, quotechar="'")Convert the json string column to a new dataframe
state_df = read_df['state'].apply(pd.Series)Merge the 2 dataframe with index number.
df = pd.merge(read_df, state_df, left_index=True, right_index=True)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | Pavan Yadiki |
| Solution 5 | Teana |
