'Grouping columns, then deleting values based on null column values
I am reposting since my last question was poorly worded. I have a table that looks like the following:
+------+------+------+------+------------+----------+
| ID 1 | ID 2 | Type | Year | Identified | Multiple |
+------+------+------+------+------------+----------+
| 100 | 10 | A | 2018 | 12 | |
| 100 | 11 | B | 2019 | | multiple |
| 100 | 12 | C | 2019 | | multiple |
| 100 | 13 | D | 2019 | | |
| 200 | 10 | A | 2018 | | |
| 200 | 11 | B | 2019 | | multiple |
| 200 | 12 | C | 2019 | | multiple |
| 200 | 13 | D | 2019 | | |
+------+------+------+------+------------+----------+
I am trying to delete the "multiple" strings inside the "Multiple" column where the ID group does not have a Identified value. For example, the first group of ID 1 == 100 contains a not-null Identified value so we can leave the "multiple" values. However, the ID 1 == 200 group has no Identified values, so I would like to remove the "multiple" values that appear in this group, giving us the following dataframe.
+------+------+------+------+------------+----------+
| ID 1 | ID 2 | Type | Year | Identified | Multiple |
+------+------+------+------+------------+----------+
| 100 | 10 | A | 2018 | 12 | |
| 100 | 11 | B | 2019 | | multiple |
| 100 | 12 | C | 2019 | | multiple |
| 100 | 13 | D | 2019 | | |
| 200 | 10 | A | 2018 | | |
| 200 | 11 | B | 2019 | | |
| 200 | 12 | C | 2019 | | |
| 200 | 13 | D | 2019 | | |
+------+------+------+------+------------+----------+
Please let me know if I can rephrase my question.
EDIT: if both Identified and Multiple columns are blank, then leave blank.
Solution 1:[1]
Assuming your dataframe can be built with the same types than this one (note for next time how you can provide a sample dataframe as one can't directly copy yours)
df= pd.DataFrame({
'ID 1': [100]*4+[200]*4,
'other_cols':range(8),
'Identified':['12']+['']*7,
'Multiple':['','multiple','multiple','']*2
})
print(df)
# ID 1 other_cols Identified Multiple
# 0 100 0 12
# 1 100 1 multiple
# 2 100 2 multiple
# 3 100 3
# 4 200 4
# 5 200 5 multiple
# 6 200 6 multiple
# 7 200 7
So to do the job, check where Identified if not equal (ne) to blank. groupby.transform with any to get True for all the same ID if one is not blank. Use the reverse (~) of this mask to select the IDs with only blank and assign blank in the other column.
# greate a mask with True where at least one non blank value per ID
mask = df['Identified'].ne('').groupby(df['ID 1']).transform(any)
# reverse the mask and assign blank
df.loc[~mask, 'Multiple'] = ''
print(df)
# ID 1 other_cols Identified Multiple
# 0 100 0 12
# 1 100 1 multiple
# 2 100 2 multiple
# 3 100 3
# 4 200 4
# 5 200 5
# 6 200 6
# 7 200 7
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ben.T |
