'deduplicating arrays in columns with mixed data types (Python)
I have a dataframe with the mixed column datatypes that contain strings, arrays, ints. All the arrays are dtype=object.
>>> test = pd.DataFrame({'id': ['a','b','c', 'd'],
'state': ['Arizona', np.array(['Texas', 'Texas', 'Texas']), 'Texas', np.array(['Texas', 'California'])],
'zip': [91239, 21939, np.array([12941,13511]), np.array([11111, 11111, 11111])]})
>>> test
id state zip
0 a Arizona 91239
1 b [Texas, Texas, Texas] 21939
2 c Texas [12941, 13511]
3 d [Texas, California] [11111, 11111, 11111]
My desired output is to deduplicate arrays wherever they exist and when there are more than one different items in an array, to replace it with a string that says 'Multiple'
desired_output
id state zip
0 a Arizona 91239
1 b Texas 21939
2 c Texas Multiple
3 d Multiple 11111
I've tried to follow weird logic where I first create temp columns that count the number of unique items within a column, or that check if all() items in an array match the first indexed item, but these are all breaking. Thanks for any help!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
