'Checking if a set of strings exist in a column with a custom function

Fellow contributors,

I would like to check if a set of specific key words exists on a grouped pandas DataFrame. The words I would like to check are start, pending and either finished or almost_finished. I would like to define a custom function for this and apply it to pandas groupby as defining a function to apply on columns is a bit not clear for me comparing to rowwise operations where we address every row with (row[colname]). In this example if the sequence of the desired words exist I would like the last value in column number for each ID to be copied in a new column and it doesn't matter if other values before that are empty strings. Here is a reproducible example:

import pandas as pd

df = pd.DataFrame({'ID' : [1100, 1100, 1100, 1200, 1200, 1200, 1300, 1300],
                  'number' : ['Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No'],
                  'status' : ['start', 'pending', 'finished', 'start', 'pending', 'partially_finished', 'start', 'pending']})

In this case the last group of ID == 1300 has no return value. Basically I am asking this question to learn the best approach for these kinda problem where you need to check some values in a column, since I am coming from R I need to familiarize myself with the way I would do the same thing in Python. I would also appreciate any better solution you may suggest. Thank you very much in advance.

Solution 1:^[1]

You can aggregate with set and use intersection to check.

But first, I would map partially_finished or almost_finished to finished, if these should be treated equally.

df['status'] = df.status.replace('partially_finished|almost_finished', 'finished', regex=True)

Next, aggregate number to last value and status to set, then I use intersect to check if all values are existing in status.

checkcriteria = {'start', 'pending', 'finished'}
df = df.groupby('ID').agg({'number': 'last', 'status': set})
df['check'] = df.status.transform(lambda x: len(x.intersection(checkcriteria)) == 3)

This should give a result,

     number                      status  check
ID
1100     No  {start, pending, finished}   True
1200     No  {start, pending, finished}   True
1300     No            {start, pending}  False

You can either filter by check or mask and remove the value for number.

# This will only return ID == 1100, 1200
df[df.check]

# OR mask to remove the number value for when check == False
df.loc[~df.check, 'number'] = None

Solution 2:^[2]

df.groupby.apply is probably what you're looking for. You can apply a function to each group and return either a single value, series, or dataframe. The result will be aggregated together.

e.g. the following function

def return_last_num(df):
    if df.status.str.contains('start').any() & df.status.str.contains('pending').any() \
        & (df.status.str.contains('finished').any() | df.status.str.contains('partically_finished').any()):
        df['last_number'] = df.number.values[-1]
    else:
        df['last_number'] = str()
    return df

When applied: df.groupby('ID').apply(return_last_num).

Returns:

     ID number              status last_number
0  1100    Yes               start          No
1  1100     No             pending          No
2  1100     No            finished          No
3  1200    Yes               start          No
4  1200     No             pending          No
5  1200     No  partially_finished          No
6  1300    Yes               start            
7  1300     No             pending

Alternatively: returning a single value

def return_last_num(df):
    if df.status.str.contains('start').any() & df.status.str.contains('pending').any() \
        & (df.status.str.contains('finished').any() | df.status.str.contains('partically_finished').any()):
        return df.number.values[-1]
    else:
        return str()

When applied: df.groupby('ID').apply(return_last_num).

Returns just the last 'number' value per ID if the sequence condition was matched:

ID
1100    No
1200    No
1300

Solution 3:^[3]

I was thinking on a vectorized approach.

First, if finished and almost finished have the same effect, I would "merge" them, and would make them a unique number easy to check:

>>> df['status2'] = df['status'].map({'finished':1,'partially_finished':1,'pending':10,'start':100})
>>> df
     ID number              status  status2
0  1100    Yes               start      100
1  1100     No             pending       10
2  1100     No            finished        1
3  1200    Yes               start      100
4  1200     No             pending       10
5  1200     No  partially_finished        1
6  1300    Yes               start      100
7  1300     No             pending       10

That allows me to "extract" the desired status (100+10+1):

idstatus=df.groupby('ID', sort=False).sum('status2')==111
      status2
ID
1100     True
1200     True
1300    False

And the actual number values: if you don't need to match all words only this line would be enough.

valuenumber=df.query('status2==1').set_index('ID')
     number              status  status2
ID
1100     No            finished        1
1200     No  partially_finished        1

And finally merge:

idstatus.merge(valuenumber, left_index=True, right_index=True, how='left')
>>> idstatus.merge(valuenumber, left_index=True, right_index=True, how='left')
      status2_x number              status  status2_y
ID
1100       True     No            finished        1.0
1200       True     No  partially_finished        1.0
1300      False    NaN                 NaN        NaN

it is possible to merge less entries:

>>> idstatus.merge(valuenumber[['number','status']], left_index=True, right_index=True, how='left')
      status2 number              status
ID
1100     True     No            finished
1200     True     No  partially_finished
1300    False    NaN                 NaN
>>>

Edit: If you only want to have the outputs of the finished ones, you can get it also with this oneliner:

>>> valuenumber.loc[idstatus[idstatus.status2].index][['number','status']]
     number              status
ID
1100     No            finished
1200     No  partially_finished
>>>

Edit2: Just benchmarked Interestingly, the intersect solution is faster, at least for the sample data:

groubpy-apply: 2.86ms
aggregate & merge: 3.90ms
vector & loc: 3.96ms
set-intersect: 2.4ms albeit using lambda

Edit3: I was too dense yesterday. Using Emma's idea of "two operations in one groupby", and assuming that items are status is ordered:

df2['status'] = df2.status.map({'finished':1,'partially_finished':1,'pending':10,'start':100})
x = df2.groupby('ID').agg({'number': 'last', 'status': 'sum'})
x[x.status==111]

At 2.1ms, it is slightly faster than versions with apply / lambda

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Emma
Solution 2	quizzical_panini
Solution 3

'Checking if a set of strings exist in a column with a custom function

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]