'Checking if a set of strings exist in a column with a custom function
Fellow contributors,
I would like to check if a set of specific key words exists on a grouped pandas DataFrame. The words I would like to check are start, pending and either finished or almost_finished. I would like to define a custom function for this and apply it to pandas groupby as defining a function to apply on columns is a bit not clear for me comparing to rowwise operations where we address every row with (row[colname]).
In this example if the sequence of the desired words exist I would like the last value in column number for each ID to be copied in a new column and it doesn't matter if other values before that are empty strings. Here is a reproducible example:
import pandas as pd
df = pd.DataFrame({'ID' : [1100, 1100, 1100, 1200, 1200, 1200, 1300, 1300],
'number' : ['Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No'],
'status' : ['start', 'pending', 'finished', 'start', 'pending', 'partially_finished', 'start', 'pending']})
In this case the last group of ID == 1300 has no return value.
Basically I am asking this question to learn the best approach for these kinda problem where you need to check some values in a column, since I am coming from R I need to familiarize myself with the way I would do the same thing in Python. I would also appreciate any better solution you may suggest.
Thank you very much in advance.
Solution 1:[1]
You can aggregate with set and use intersection to check.
But first, I would map partially_finished or almost_finished to finished, if these should be treated equally.
df['status'] = df.status.replace('partially_finished|almost_finished', 'finished', regex=True)
Next, aggregate number to last value and status to set, then I use intersect to check if all values are existing in status.
checkcriteria = {'start', 'pending', 'finished'}
df = df.groupby('ID').agg({'number': 'last', 'status': set})
df['check'] = df.status.transform(lambda x: len(x.intersection(checkcriteria)) == 3)
This should give a result,
number status check
ID
1100 No {start, pending, finished} True
1200 No {start, pending, finished} True
1300 No {start, pending} False
You can either filter by check or mask and remove the value for number.
# This will only return ID == 1100, 1200
df[df.check]
# OR mask to remove the number value for when check == False
df.loc[~df.check, 'number'] = None
Solution 2:[2]
df.groupby.apply is probably what you're looking for. You can apply a function to each group and return either a single value, series, or dataframe. The result will be aggregated together.
e.g. the following function
def return_last_num(df):
if df.status.str.contains('start').any() & df.status.str.contains('pending').any() \
& (df.status.str.contains('finished').any() | df.status.str.contains('partically_finished').any()):
df['last_number'] = df.number.values[-1]
else:
df['last_number'] = str()
return df
When applied: df.groupby('ID').apply(return_last_num).
Returns:
ID number status last_number
0 1100 Yes start No
1 1100 No pending No
2 1100 No finished No
3 1200 Yes start No
4 1200 No pending No
5 1200 No partially_finished No
6 1300 Yes start
7 1300 No pending
Alternatively: returning a single value
def return_last_num(df):
if df.status.str.contains('start').any() & df.status.str.contains('pending').any() \
& (df.status.str.contains('finished').any() | df.status.str.contains('partically_finished').any()):
return df.number.values[-1]
else:
return str()
When applied: df.groupby('ID').apply(return_last_num).
Returns just the last 'number' value per ID if the sequence condition was matched:
ID
1100 No
1200 No
1300
Solution 3:[3]
I was thinking on a vectorized approach.
First, if finished and almost finished have the same effect, I would "merge" them, and would make them a unique number easy to check:
>>> df['status2'] = df['status'].map({'finished':1,'partially_finished':1,'pending':10,'start':100})
>>> df
ID number status status2
0 1100 Yes start 100
1 1100 No pending 10
2 1100 No finished 1
3 1200 Yes start 100
4 1200 No pending 10
5 1200 No partially_finished 1
6 1300 Yes start 100
7 1300 No pending 10
That allows me to "extract" the desired status (100+10+1):
idstatus=df.groupby('ID', sort=False).sum('status2')==111
status2
ID
1100 True
1200 True
1300 False
And the actual number values: if you don't need to match all words only this line would be enough.
valuenumber=df.query('status2==1').set_index('ID')
number status status2
ID
1100 No finished 1
1200 No partially_finished 1
And finally merge:
idstatus.merge(valuenumber, left_index=True, right_index=True, how='left')
>>> idstatus.merge(valuenumber, left_index=True, right_index=True, how='left')
status2_x number status status2_y
ID
1100 True No finished 1.0
1200 True No partially_finished 1.0
1300 False NaN NaN NaN
it is possible to merge less entries:
>>> idstatus.merge(valuenumber[['number','status']], left_index=True, right_index=True, how='left')
status2 number status
ID
1100 True No finished
1200 True No partially_finished
1300 False NaN NaN
>>>
Edit: If you only want to have the outputs of the finished ones, you can get it also with this oneliner:
>>> valuenumber.loc[idstatus[idstatus.status2].index][['number','status']]
number status
ID
1100 No finished
1200 No partially_finished
>>>
Edit2: Just benchmarked Interestingly, the intersect solution is faster, at least for the sample data:
- groubpy-apply: 2.86ms
- aggregate & merge: 3.90ms
- vector & loc: 3.96ms
- set-intersect: 2.4ms albeit using lambda
Edit3: I was too dense yesterday. Using Emma's idea of "two operations in one groupby", and assuming that items are status is ordered:
df2['status'] = df2.status.map({'finished':1,'partially_finished':1,'pending':10,'start':100})
x = df2.groupby('ID').agg({'number': 'last', 'status': 'sum'})
x[x.status==111]
At 2.1ms, it is slightly faster than versions with apply / lambda
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Emma |
| Solution 2 | quizzical_panini |
| Solution 3 |
