'How to efficiently normalize a yaml to pandas dataframe

I loaded the following yaml and I'm trying to parse it to a dataframe

data = {
    "clk": {
        "imgt": {
            "human": [
                "IGHV1-2*02",
                "IGKV1-33*01",
                "IGKJ3*01",
                "IGKJ4*01",
                "IGKJ4*02",
                "IGHJ2*01",
                "IGHJ3*02",
                "IGHJ5*02",
                "IGHD3-10*01",
                "IGHD3-16*02",
                "IGHD6-13*01",
                "IGKV1-5*03",
                "IGHJ4*02",
                "IGHD3-9*01",
                "IGLV2-11*01",
                "IGLJ1*01",
            ],
            "mouse": [
                "IGHV1-11*01",
                "IGHV1-12*01",
                "IGHV1-13*01",
                "IGHV1-14*01",
                "IGHV1-15*01",
                "IGHV1-16*01",
                "IGHV1-17-1*01",
                "IGHV1-18*01",
                "IGHV1-18*02",
                "IGHV1-18*03",
                "IGHV1-19*01",
                "IGLJ5*01",
            ],
        }
    }
}

def inefficient_normalizer(data):
    dataframe_loader = []
    for name in data:
        for source in data.get(name):
            for species in data.get(name).get(source):
                dataframe_loader.append(
                    {
                        "name": name,
                        "source": source,
                        "species": species,
                        "genes": data.get(name).get(source).get(species),
                    }
                )
    return pd.DataFrame(dataframe_loader).explode("genes").reset_index(drop=True)


>>>print(inefficient_normalizer)
name source species          genes
0   clk   imgt   human     IGHV1-2*02
1   clk   imgt   human    IGKV1-33*01
2   clk   imgt   human       IGKJ3*01
3   clk   imgt   human       IGKJ4*01
4   clk   imgt   human       IGKJ4*02
5   clk   imgt   human       IGHJ2*01
6   clk   imgt   human       IGHJ3*02
7   clk   imgt   human       IGHJ5*02
...

However, I'm wondering how I can do this with the builtin json_normalize like I see here. The difference is that I only have a key for each level and not a key pair. Right now I get

import pandas as pd
pd.json_normalize(data)
                                      clk.imgt.human                                     clk.imgt.mouse
0  [IGHV1-2*02, IGKV1-33*01, IGKJ3*01, IGKJ4*01, ...  [IGHV1-11*01, IGHV1-12*01, IGHV1-13*01, IGHV1-...

Any tips? I won't know the column names ahead of time.



Solution 1:[1]

You could convert the data from json_normalize like this:

import pandas as pd
data = {
    "clk": {
        "imgt": {
            "human": [
                "IGHV1-2*02",
                "IGKV1-33*01",
                "IGKJ3*01",
                "IGKJ4*01",
                "IGKJ4*02",
                "IGHJ2*01",
                "IGHJ3*02",
                "IGHJ5*02",
                "IGHD3-10*01",
                "IGHD3-16*02",
                "IGHD6-13*01",
                "IGKV1-5*03",
                "IGHJ4*02",
                "IGHD3-9*01",
                "IGLV2-11*01",
                "IGLJ1*01",
            ],
            "mouse": [
                "IGHV1-11*01",
                "IGHV1-12*01",
                "IGHV1-13*01",
                "IGHV1-14*01",
                "IGHV1-15*01",
                "IGHV1-16*01",
                "IGHV1-17-1*01",
                "IGHV1-18*01",
                "IGHV1-18*02",
                "IGHV1-18*03",
                "IGHV1-19*01",
                "IGLJ5*01",
            ],
        }
    }
}
df = pd.json_normalize(data)
df = df.T
df.index = df.index.str.split(".").map(tuple)
df = df.reset_index().explode(0)

I transpose the table to make the columns into rows, then I split the column names at the dot and create tuples from the resulting lists, turning it into a multilevel index. Resetting the index and using explode to create one line per entry then does the trick.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1