'Pandas: Find difference in rows with same index in any column
Sample dataframe:
In [1898]: df = pd.DataFrame({'index':[0,0,5,5,6,6,8,8], 'table_name':['f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person'], 'column_name':['active', 'actv', 'ssn', 'ssn', 'pl', '
...: pl', 'prefix', 'prefix'], 'data_type':['integer', 'integer', 'varchar', 'varchar', 'varchar', 'varchar', 'varchar', 'integer'], 'default':[np.nan, np.nan, np.nan, np.nan, 10, np.nan, np.nan, np.nan], 'max_length
...: ':[np.nan, np.nan, 256, 99, 256, 256, 256, 256]})
In [1899]: df = pd.DataFrame({'index':[0,0,5,5,6,6,8,8], 'table_name':['f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person'], 'column_name':['active', 'actv', 'ssn', 'ssn', 'pl', '
...: pl', 'prefix', 'prefix'], 'data_type':['integer', 'integer', 'varchar', 'varchar', 'varchar', 'bigint', 'varchar', 'integer'], 'default':[np.nan, np.nan, np.nan, np.nan, 10, np.nan, np.nan, np.nan], 'max_length'
...: :[np.nan, np.nan, 256, 99, 256, 256, 256, 256]})
In [1900]: df = df.set_index('index')
In [1901]: df
Out[1901]:
table_name column_name data_type default max_length
index
0 f_person active integer NaN NaN
0 f_person actv integer NaN NaN
5 f_person ssn varchar NaN 256.0
5 f_person ssn varchar NaN 99.0
6 f_person pl varchar 10.0 256.0
6 f_person pl bigint NaN 256.0
8 f_person prefix varchar NaN 256.0
8 f_person prefix integer NaN 256.0
If you see here, the rows with common index have atleast one difference amongst them.
For ex:
Rows with index 0, have difference in column_name.
Rows with index 5, have difference in max_length.
Rows with index 6, have differences in both data_type and default.
Rows with index 8, have difference in data_type.
Expected Output:
[
{
0: {'column_name': ['active', 'actv']},
5: {'max_length': [256, 99]},
6: {'data_type': ['varchar', 'bigint'], 'default': [10, np.nan]},
8: {'data_type': ['varchar', 'integer']}
}
]
This is part of a bigger problem. I've kind of solved it till here. Not sure how to proceed further. Any ideas?
Solution 1:[1]
Don't set the index of df and then just run this:
output = [(
df.groupby('index')
.apply(lambda data: {col: data[col].unique().tolist()
for col in data.columns
if len(data[col].unique()) > 1})
.to_dict()
)]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Riccardo Bucco |
