'pandas merge rows with overlapping column names using default non-NaN value
There is a list of tables and some configurations, specified in csv format.
I want to use default configuration from configs, and override it in tables if needed
configs = """
schema|new_name
dbo|{table}
qa|test_{table}
"""
tables = """
table|schema|new_name
employee|hr|{schema}_{table}
advertisers
users
"""
configs = pandas.read_csv(io.StringIO(configs), sep='|')
tables = pandas.read_csv(io.StringIO(tables), sep='|')
I want to cross-join/merge/concatenate/combine them to get a dataframe which contains:
final = """
table|schema|new_name
employee|hr|{schema}_{table}
advertisers|dbo|{table}
users|dbo|{table}
employee|hr|{schema}_{table}
advertisers|qa|test_{table}
users|qa|test_{table}
"""
If schema is not specified, use 'dbo' schema and 'users/advertisers' table name.
If schema is specified, use 'hr' schema and 'hr_employee' table name.
Basically - when horizontal concat of 2 rows with overlapping column names, create one column using whichever value is not NaN.
What pandas command should I use ?
Solution 1:[1]
EDIT:
#cross join both DataFrames
df = tables.merge(configs, suffixes=('','_'), how='cross')
#get columns with suffix _
cols = df.columns[df.columns.str.endswith('_')]
#remove suffix
new = cols.str.strip('_')
#replace missing values from cols by _ colums
df[new] = df[new].fillna(df[cols].rename(columns=lambda x: x.strip('_')))
#remove columns with _
df = df.drop(cols, axis=1)
print (df)
table schema new_name
0 employee hr {schema}_{table}
1 employee hr {schema}_{table}
2 advertisers dbo {table}
3 advertisers qa test_{table}
4 users dbo {table}
5 users qa test_{table}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
