'How to combine dataframes but combine matching values and marking when there's a match?
New to datascience, sorry about the confusing wording. I need to combine two dataframes that have similar data. If column "C Name", exists in both, then I don't want it duplicated but a checkmark placed in both "SW1 contain" and "SW2 contain". If a C Name is unique, its just added to the combined table with its respective checkmark.
These are 2 simplified samples of my tables. I've to combine 4 frames together
df_s = pd.DataFrame({'C Category': ["Hull","Hull","Hull","Mid High","Mid High","Mid High2"],
'C Name': ["Trunk","Roof","Ceiling","Floor","Plank","Fuel"],
'SW1 Contain': ["NAN","NAN","NAN","NAN","NAN","NAN"],
'SW1 Type': ["NAN","NAN","NAN","NAN","NAN","NAN"],
'SW2 Contain': ["√","√","√","√","√","√"],
'SW2 Type': ["NAN","NAN","NAN","NAN","NAN","NAN"]})
C Category C Name SW1 Contain SW1 Type SW2 Contain SW2 Type
0 Hull Trunk NAN NAN √ NAN
1 Hull Roof NAN NAN √ NAN
2 Hull Ceiling NAN NAN √ NAN
3 Mid High Floor NAN NAN √ NAN
4 Mid High Plank NAN NAN √ NAN
5 Mid High2 Fuel NAN NAN √ NAN
df_s2 = pd.DataFrame({'C Category': ["Hull","Hull","Lower","Lower2","Mid High","Mid High2"],
'C Name': ["Trunk","Roof","Tank","Axel","Floor","Fuel"],
'SW1 Contain': ["√","√","√","√","√","√"],
'SW1 Type': ["NAN","NAN","NAN","NAN","NAN","NAN"],
'SW2 Contain': ["NAN","NAN","NAN","NAN","NAN","NAN"],
'SW2 Type': ["NAN","NAN","NAN","NAN","NAN","NAN"]})
C Category C Name SW1 Contain SW1 Type SW2 Contain SW2 Type
0 Hull Trunk √ NAN NAN NAN
1 Hull Roof √ NAN NAN NAN
2 Lower Tank √ NAN NAN NAN
3 Lower2 Axel √ NAN NAN NAN
4 Mid High Floor √ NAN NAN NAN
5 Mid High2 Fuel √ NAN NAN NAN
I need it to look something like this(Order of C Category/Name doesn't matter)
C Category C Name SW1 Contain SW1 Type SW2 Contain SW2 Type
0 Hull Trunk √ NAN √ NAN
1 Hull Roof √ NAN √ NAN
2 Hull Ceiling NAN NAN √ NAN
3 Lower Tank √ NAN NAN NAN
4 Lower2 Axel √ NAN NAN NAN
5 Mid High Floor √ NAN √ NAN
6 Mid High Plank NAN NAN √ NAN
7 Mid High2 Fuel √ NAN √ NAN
I've tried pd.concat([df_s,df[~df.isin(df_s)].dropna()]) (df is a blank dataframe that I'm adding the 4 tables to 1 by 1). The idea is to check if something doesn't exist in df, it it doesn't add it. But it doesn't work after 1 table is added. If that's not the best way to do this please let me know.
I'm also thinking of iterating thorough the frames and doing if statements. Something like if a row isn't in the df, append it to df. But that will be extremely costly so please let me know of the best wat to approach this. I'm still learning so would like any advice.
Solution 1:[1]
One option is to replace the "NaN" strings to NaN values, concatenate then use groupby.first to get rid of duplicate "C Category" and "C Name" pairs:
out = (pd.concat([df_s.replace('NAN', float('nan')), df_s2.replace('NAN', float('nan'))])
.groupby(['C Category', 'C Name'], as_index=False, sort=False).first().fillna(float('nan')))
Output:
C Category C Name SW1 Contain SW1 Type SW2 Contain SW2 Type
0 Hull Trunk ? NaN ? NaN
1 Hull Roof ? NaN ? NaN
2 Hull Ceiling NaN NaN ? NaN
3 Mid High Floor ? NaN ? NaN
4 Mid High Plank NaN NaN ? NaN
5 Mid High2 Fuel ? NaN ? NaN
6 Lower Tank ? NaN NaN NaN
7 Lower2 Axel ? NaN NaN NaN
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
