'Pandas column containing JSON data wont split over several columns giving error json.decoder.JSONDecodeError
I have a very large csv file (40GB) that I am processing through pandas and JSON. The file has two columns, “id” and “data”. The second column contains json data that I now want to convert into columns. However, I am getting the error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) when running this code:
import pandas as pd
import csv
import json
in_csv= "csv_in.csv"
out_csv= "csv_out.csv"
reader = pd.read_csv(in_csv, chunksize=800, sep='|', header=None, names=['id', 'data'], encoding='utf-8')
for chunk_df in reader:
chunk_df['data'] = chunk_df['data'].astype(str).apply(json.loads)
chunk_df.to_csv(out_csv, mode='a', sep='|', encoding='utf-8', header=None, index=False, na_rep="", quoting=csv.QUOTE_NONE)
My data looks like this:
id_1|{"iid":896,"groupId":"78","Name":"Jane Doe","Address":"Rue de Travaille"}
id_1|{"iid":897,"groupId":"78","Name":"John Doe","Address":"Rue de Travaille"}
id_2|
id_3|{"iid":905,"groupId":"30","Name":"James Smith","Address":"Rue de Paix"}
My goal is to make it look like this:
id_1|896|78|Jane Doe|Rue de Travaille
id_1|897|78|John Doe|Rue de Travaille
id_2| | | |
id_3|905|30|James Smith|Rue de Paix
Solution 1:[1]
You can try fill the NaN with empty dictionary string then use ast.literal_eval to covert json string to python dictionary. At last, use pd.json_normalize to convert dictionary column to multiple columns.
import ast
for chunk_df in reader:
chunk_df = pd.concat([chunk_df['id'], pd.json_normalize(chunk_df['data'].fillna('{}').apply(ast.literal_eval))], axis=1)
chunk_df.to_csv(out_csv, mode='a', sep='|', encoding='utf-8', header=None, index=False, na_rep="", quoting=csv.QUOTE_NONE)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ynjxsjmh |
