'Read nested JSON into Dask DataFrame
I am trying to read nested JSON into a Dask DataFrame, preferably with code that'll do the heavy lifting.
Here's the JSON file I am reading:
{
"data": [{
"name": "george",
"age": 16,
"exams": [{
"subject": "geometry",
"score": 56
},
{
"subject": "poetry",
"score": 88
}
]
}, {
"name": "nora",
"age": 7,
"exams": [{
"subject": "geometry",
"score": 87
},
{
"subject": "poetry",
"score": 94
}
]
}]
}
Here is the resulting DataFrame I would like.
| name | age | exam_subject | exam_score |
|---|---|---|---|
| george | 16 | geometry | 56 |
| george | 16 | poetry | 88 |
| nora | 7 | geometry | 87 |
| nora | 7 | poetry | 94 |
Here's how I'd accomplish this with pandas:
df = pd.read_json("students3.json", orient="split")
exploded = df.explode("exams")
pd.concat([exploded[["name", "age"]].reset_index(drop=True), pd.json_normalize(exploded["exams"])], axis=1)
Dask doesn't have json_normalize, so what's the best way to accomplish this task?
Solution 1:[1]
If the file contains json-lines, then the most scale-able approach is to use dask.bag and then map the pandas snippet across each bag partition.
If the file is a large json, then the opening/ending brackets will cause problems, so an additional function will be needed to remove them before mapping the text into json.
Rough pseudo-code:
import dask.bag as db
bag = db.read_text("students3.json")
# if there are json-lines
option1 = bag.map(json.loads).map(pandas_fn)
# if there is a single json
option2 = bag.map(convert_to_jsonlines).map(json.loads).map(pandas_fn)
Solution 2:[2]
Use pd.json_normalize
import json
import pandas as pd
with open('students3.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.json_normalize(data['data'], record_path='exams', meta=['name', 'age'])
subject score name age
0 geometry 56 george 16
1 poetry 88 george 16
2 geometry 87 nora 7
3 poetry 94 nora 7
Solution 3:[3]
Pydantic offers excellent JSON validation and ingest. Several Pydantic models (one of each 'top level' JSON entry) can be converted to Python dictionaries in a loop to create a list of dictionaries, type: List[Dict], which may be converted to DataFrame objects.
Solution 4:[4]
I was inspired by the other answers to come up with this solution.
ddf = dd.read_json("students3.json", orient="split")
def pandas_fn(df):
exploded = df.explode("exams")
return pd.concat(
[
exploded[["name", "age"]].reset_index(drop=True),
pd.json_normalize(exploded["exams"]),
],
axis=1,
)
res = ddf.map_partitions(
lambda df: pandas_fn(df),
meta=(
("name", "object"),
("age", "int64"),
("subject", "object"),
("score", "int64"),
),
)
print(res.compute()) gives this output:
name age subject score
0 george 16 geometry 56
1 george 16 poetry 88
2 nora 7 geometry 87
3 nora 7 poetry 94
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SultanOrazbayev |
| Solution 2 | Ynjxsjmh |
| Solution 3 | Brent |
| Solution 4 | Powers |
