'Normalise data where every second row is a column header
I have a system where you manually input data, for example data about people. Some fields are mandatory but majority are optional. When the data is outputted it doesn't normalise the data but structures it so that the first row are the field headers you've populated, and the second row the data.
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
data = [
['Name', 'Age', 'Eye_Colour', 'Height', 'Hair Colour', 'Kids', 'Job', np.nan, np.nan, np.nan, np.nan],
['John', '30', 'Brown', '130', 'Brown', 2, 'IT', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
['Name', 'Age', 'Eye_Colour', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
['Alex', 33, 'Blue', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
['Name', 'Surname', 'Eye_Colour', 'Middlename', 'Nationality', 'Age', 'Height', 'Hair Colour', 'Kids', 'Job', 'Salary'],
['Mary', 'Smith', 'Green', 'Rose', 'UK', '50', 130, 'Black', 'None', 'Sales', 120000],
['Name', 'Surname', 'Eye_Colour', 'Middlename', 'Nationality', 'Age','Salary', np.nan, np.nan, np.nan, np.nan],
['Sarah', 'Short', 'Green', 'Susie', 'Americann', 22, 25000, np.nan, np.nan, np.nan, np.nan]
]
joined_data = pd.DataFrame(data, columns = [1,2,3,4,5,6,7,8,9,10,11, 12, 13, 14])
I need to
Normalise the data entirely. So this would mean the 5th row, or the column headers for
Mary, would be the column headers for the whole dataframe. And the other entries would align up to the correct headings.separate the data into data frames based on a matching variable, in this case
Eye_Colour. What that looks like here is 3 data frames, one each for forBrown,Blue, andGreen. Each of these would have a different number of column headers. The dataframe forBluewould only have 3 columns.Eye_Colouris a mandatory field and is always in the same place, ie always be returned withjoined_data[3]
I have started on 2) by removing the final 3 columns through dropna(axis = 1, how = 'all'), where there is no data at all for any rows. Then filtering on Eye_Colour and also taking the row above
joined_data_mask = ((joined_data[3] == 'Green') | (joined_data[3].shift(-1) == 'Green') )
joined_data[joined_data_mask]
Which returns data for Green. My next thought was to calculate the row length excl Nans (which may be different for each person). I'd then group by matching ones and separate these into different data frames. I'd then take the df with the most column headers, use that as my template and append all the others to it.
I think I could use this solution to solve problem 1), but I am not sure.
The data I'm dealing with is large - about 300 columns and > 150000 rows. There are also about 300 types of 'eye_colour', so efficiency is reasonably important. Any help appreciated.
Solution 1:[1]
If think separating dataframes by going through data 2 by 2 is the right approach. I don't understand the bit about matching variables though.
pd.concat will concatenate the dataframes and align columns. You'll have to remove nan values while creating the dataframes:
pd.concat([
pd.DataFrame(
[[x for x in data[i+1] if str(x)!='nan']],
columns = [x for x in data[i] if str(x)!='nan'])
for i in range(0, len(data), 2)])
.reset_index(drop=True)
Output:
Name Age Eye_Colour Height Hair Colour Kids Job Surname Middlename Nationality Salary
0 John 30 Brown 130 Brown 2 IT NaN NaN NaN NaN
1 Alex 33 Blue NaN NaN NaN NaN NaN NaN NaN NaN
2 Mary 50 Green 130 Black None Sales Smith Rose UK 120000.0
3 Sarah 22 Green NaN NaN NaN NaN Short Susie Americann 25000.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Tranbi |
