'Normalise data where every second row is a column header

I have a system where you manually input data, for example data about people. Some fields are mandatory but majority are optional. When the data is outputted it doesn't normalise the data but structures it so that the first row are the field headers you've populated, and the second row the data.

import pandas as pd 
import numpy as np
import pandas as pd 
import numpy as np
data = [
    
    ['Name', 'Age', 'Eye_Colour', 'Height', 'Hair Colour', 'Kids', 'Job', np.nan, np.nan, np.nan, np.nan],
    ['John', '30', 'Brown', '130', 'Brown', 2, 'IT', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
    ['Name', 'Age', 'Eye_Colour', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
    ['Alex', 33, 'Blue', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
    ['Name', 'Surname', 'Eye_Colour', 'Middlename', 'Nationality', 'Age', 'Height', 'Hair Colour', 'Kids', 'Job', 'Salary'],
    ['Mary', 'Smith', 'Green', 'Rose', 'UK', '50', 130, 'Black', 'None', 'Sales', 120000], 
    ['Name', 'Surname', 'Eye_Colour', 'Middlename', 'Nationality', 'Age','Salary', np.nan, np.nan, np.nan, np.nan],
    ['Sarah', 'Short', 'Green', 'Susie', 'Americann', 22, 25000, np.nan, np.nan, np.nan, np.nan]
]

joined_data = pd.DataFrame(data, columns = [1,2,3,4,5,6,7,8,9,10,11, 12, 13, 14])

I need to

  1. Normalise the data entirely. So this would mean the 5th row, or the column headers for Mary, would be the column headers for the whole dataframe. And the other entries would align up to the correct headings.

  2. separate the data into data frames based on a matching variable, in this case Eye_Colour. What that looks like here is 3 data frames, one each for for Brown, Blue, and Green. Each of these would have a different number of column headers. The dataframe for Blue would only have 3 columns. Eye_Colour is a mandatory field and is always in the same place, ie always be returned with joined_data[3]

I have started on 2) by removing the final 3 columns through dropna(axis = 1, how = 'all'), where there is no data at all for any rows. Then filtering on Eye_Colour and also taking the row above

joined_data_mask = ((joined_data[3] == 'Green') | (joined_data[3].shift(-1) == 'Green') )
joined_data[joined_data_mask]

Which returns data for Green. My next thought was to calculate the row length excl Nans (which may be different for each person). I'd then group by matching ones and separate these into different data frames. I'd then take the df with the most column headers, use that as my template and append all the others to it.

I think I could use this solution to solve problem 1), but I am not sure.

The data I'm dealing with is large - about 300 columns and > 150000 rows. There are also about 300 types of 'eye_colour', so efficiency is reasonably important. Any help appreciated.



Solution 1:[1]

If think separating dataframes by going through data 2 by 2 is the right approach. I don't understand the bit about matching variables though.

pd.concat will concatenate the dataframes and align columns. You'll have to remove nan values while creating the dataframes:

pd.concat([
        pd.DataFrame(
            [[x for x in data[i+1] if str(x)!='nan']], 
            columns = [x for x in data[i] if str(x)!='nan'])
        for i in range(0, len(data), 2)])
        .reset_index(drop=True)

Output:

    Name Age Eye_Colour Height Hair Colour  Kids    Job Surname Middlename Nationality    Salary
0   John  30      Brown    130       Brown     2     IT     NaN        NaN         NaN       NaN
1   Alex  33       Blue    NaN         NaN   NaN    NaN     NaN        NaN         NaN       NaN
2   Mary  50      Green    130       Black  None  Sales   Smith       Rose          UK  120000.0
3  Sarah  22      Green    NaN         NaN   NaN    NaN   Short      Susie   Americann   25000.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tranbi