'Pandas: Detect change in a group and change all other entries
I have a couple of datasets with names and ids. One person can be present in a dataset more then one time then this person has the same id in the dataset. The name of person can differ within the dataset and over all datasets. I need to assign UID to each person across all datasets.
This is a example dataset, where two datasets were merged into one. Ids starting with letter 'a' are from first dataset and ids starting with 'b' from the second one.
df = pd.DataFrame({'name': ['Bob', 'Brb R.','Alice', 'John', 'Jn D.','Suzy', 'bob', 'greg', 'john', 'suzy', 'suz Q.', 'sz Qual'],
'id': ['a1', 'a1', 'a2', 'a3', 'a3', 'a4', 'b1', 'b2', 'b3', 'b4', 'b4', 'b4']})
name id
0 Bob a1
1 Brb R. a1
2 Alice a2
3 John a3
4 Jn D. a3
5 Suzy a4
6 bob b1
7 greg b2
8 john b3
9 suzy b4
10 suz Q. b4
11 sz Qual b4
My idea was to set UID to groups with same id which woud result in something like this:
name id uid
0 Bob a1 u1
1 Brb R. a1 u1
2 Alice a2 u2
3 John a3 u3
4 Jn D. a3 u3
5 Suzy a4 u4
6 bob b1 u5
7 greg b2 u6
8 john b3 u7
9 suzy b4 u8
10 suz Q. b4 u8
11 sz Qual b4 u8
Then I would find a match over lowercased names and assign the same uid over datasets. I would like to know how can I detect the change of UID in one group and change all others in the group accordingly. Let's say I'm gonna find a match on the name 'suzy' and I will assign uid 'u4' to 'suzy' with id 'b4', how could I assign uid 'u4' to all others in the group with id 'b4' (suz Q., sz Qual).
The result should look like this:
name id uid
0 Bob a1 u1
1 Brb R. a1 u1
2 Alice a2 u2
3 John a3 u3
4 Jn D. a3 u3
5 Suzy a4 u4
6 bob b1 u1
7 greg b2 u6
8 john b3 u3
9 suzy b4 u4
10 suz Q. b4 u4
11 sz Qual b4 u4
uniq = df['id'].unique()
df['uid'] = np.nan
for id in uniq:
uid = uuid.uuid4().hex
df.loc[df['id'] == id, "uid"] = uid
df['names lower'] = df['name'].str.lower()
Thank you for any advice.
EDIT: The ids don't need to be sorted across datasets and their values are on the broad scale. Meaning that the id could have a value 'Br5hdj45x9' etc.
Solution 1:[1]
Try:
- Create a series of "uid" based on the "id"
- Create "temp_id" mapping each name (lower case) to an "id"
- Re-map each name to get the same ids for matching id or name and get the "uid".
uid = "u"+df.groupby("id").ngroup().add(1).astype(str)
temp_id = df.groupby(df["name"].str.lower())["id"].transform('first')
df["uid"] = uid.groupby(df["id"].map(temp_id.groupby(df["id"]).first())).transform("first")
>>> df
name id uid
0 Bob a1 u1
1 Brb R. a1 u1
2 Alice a2 u2
3 John a3 u3
4 Jn D. a3 u3
5 Suzy a4 u4
6 bob b1 u1
7 greg b2 u6
8 john b3 u3
9 suzy b4 u4
10 suz Q. b4 u4
11 sz Qual b4 u4
Solution 2:[2]
You can use:
df['uid'] = 'u' + df['id'].ne(df['id'].shift()).cumsum().astype(str)
print(df)
# Output
name id uid
0 Bob a1 u1
1 Brb R. a1 u1
2 Alice a2 u2
3 John a3 u3
4 Jn D. a3 u3
5 Suzy a4 u4
6 bob b1 u5
7 greg b2 u6
8 john b3 u7
9 suzy b4 u8
10 suz Q. b4 u8
11 sz Qual b4 u8
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Corralien |
