'How to efficiently normalize a yaml to pandas dataframe
I loaded the following yaml and I'm trying to parse it to a dataframe
data = {
"clk": {
"imgt": {
"human": [
"IGHV1-2*02",
"IGKV1-33*01",
"IGKJ3*01",
"IGKJ4*01",
"IGKJ4*02",
"IGHJ2*01",
"IGHJ3*02",
"IGHJ5*02",
"IGHD3-10*01",
"IGHD3-16*02",
"IGHD6-13*01",
"IGKV1-5*03",
"IGHJ4*02",
"IGHD3-9*01",
"IGLV2-11*01",
"IGLJ1*01",
],
"mouse": [
"IGHV1-11*01",
"IGHV1-12*01",
"IGHV1-13*01",
"IGHV1-14*01",
"IGHV1-15*01",
"IGHV1-16*01",
"IGHV1-17-1*01",
"IGHV1-18*01",
"IGHV1-18*02",
"IGHV1-18*03",
"IGHV1-19*01",
"IGLJ5*01",
],
}
}
}
def inefficient_normalizer(data):
dataframe_loader = []
for name in data:
for source in data.get(name):
for species in data.get(name).get(source):
dataframe_loader.append(
{
"name": name,
"source": source,
"species": species,
"genes": data.get(name).get(source).get(species),
}
)
return pd.DataFrame(dataframe_loader).explode("genes").reset_index(drop=True)
>>>print(inefficient_normalizer)
name source species genes
0 clk imgt human IGHV1-2*02
1 clk imgt human IGKV1-33*01
2 clk imgt human IGKJ3*01
3 clk imgt human IGKJ4*01
4 clk imgt human IGKJ4*02
5 clk imgt human IGHJ2*01
6 clk imgt human IGHJ3*02
7 clk imgt human IGHJ5*02
...
However, I'm wondering how I can do this with the builtin json_normalize like I see here. The difference is that I only have a key for each level and not a key pair. Right now I get
import pandas as pd
pd.json_normalize(data)
clk.imgt.human clk.imgt.mouse
0 [IGHV1-2*02, IGKV1-33*01, IGKJ3*01, IGKJ4*01, ... [IGHV1-11*01, IGHV1-12*01, IGHV1-13*01, IGHV1-...
Any tips? I won't know the column names ahead of time.
Solution 1:[1]
You could convert the data from json_normalize like this:
import pandas as pd
data = {
"clk": {
"imgt": {
"human": [
"IGHV1-2*02",
"IGKV1-33*01",
"IGKJ3*01",
"IGKJ4*01",
"IGKJ4*02",
"IGHJ2*01",
"IGHJ3*02",
"IGHJ5*02",
"IGHD3-10*01",
"IGHD3-16*02",
"IGHD6-13*01",
"IGKV1-5*03",
"IGHJ4*02",
"IGHD3-9*01",
"IGLV2-11*01",
"IGLJ1*01",
],
"mouse": [
"IGHV1-11*01",
"IGHV1-12*01",
"IGHV1-13*01",
"IGHV1-14*01",
"IGHV1-15*01",
"IGHV1-16*01",
"IGHV1-17-1*01",
"IGHV1-18*01",
"IGHV1-18*02",
"IGHV1-18*03",
"IGHV1-19*01",
"IGLJ5*01",
],
}
}
}
df = pd.json_normalize(data)
df = df.T
df.index = df.index.str.split(".").map(tuple)
df = df.reset_index().explode(0)
I transpose the table to make the columns into rows, then I split the column names at the dot and create tuples from the resulting lists, turning it into a multilevel index. Resetting the index and using explode to create one line per entry then does the trick.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
