'Pandas pick the higher value for each unique id
I have a df of customers
CUST_ID | SEGMENT | AREA
1 | B | CAD
1 | A | RAM
2 | B | CAD
2 | C | RAM
3 | B | RAM
4 | A | RAM
I want to count the unique number of CUST_ID per SEGMENT so I did
df.groupby(['SEGMENT'])['CUST_ID'].nunique()
However if there are same CUST_ID with different SEGMENT types then the number per SEGMENT gets inflated. I want to pick the highest value SEGMENT per CUST_ID and then count. A being the highest and C being the lowest. So the resulting df would look like:
CUST_ID | SEGMENT | AREA
1 | A | RAM
2 | B | CAD
3 | B | RAM
4 | A | RAM
and the count would be
- A - 2
- B - 2
- C - 0
How would I be able to do this?
Solution 1:[1]
You can try groupby CUST_ID column then filter rows by getting the min value of SEGMENT column.
out = (df.groupby(['CUST_ID'])
.apply(lambda g: g[g['SEGMENT'].eq(g['SEGMENT'].min())])
.reset_index(drop=True))
NOTE: Since you want to pick the highest value SEGMENT per CUST_ID and then count, A being the highest and C being the lowest, in ASCII talbe, A is 65, C is 67. When comparing, A actually is smaller than C. That's why use min here.
print(out)
CUST_ID SEGMENT AREA
0 1 A RAM
1 2 B CAD
2 3 B RAM
3 4 A RAM
res = out.value_counts('SEGMENT')
print(res)
A 2
B 2
Name: SEGMENT, dtype: int64
Solution 2:[2]
You can go like this:
(df.sort_values('SEGMENT').drop_duplicates('CUST_ID') # remove duplicates, keep only first 'CUST_ID'
.groupby('SEGMENT')['CUST_ID'].nunique() # or just `.size()` because there are no duplicates
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | wjandrea |
