'Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe

Looping over a list of bigrams to search for, I need to create a boolean field for each bigram according to whether or not it is present in a tokenized pandas series. And I'd appreciate an upvote if you think this is a good question!

List of bigrams:

bigrams = ['data science', 'computer science', 'bachelors degree']

Dataframe:

df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
                                            ['computer', 'science', 'degree', 'masters'],
                                            ['bachelors', 'degree', 'computer', 'vision'],
                                            ['data', 'processing', 'science']]})

Desired Output:

                         job_description  data science computer science bachelors degree
0        [data, science, degree, expert]          True            False            False
1   [computer, science, degree, masters]         False             True            False
2  [bachelors, degree, computer, vision]         False            False             True
3             [data, bachelors, science]         False            False            False

Criteria:

  1. Only exact matches should be replaced (for example, flagging for 'data science' should return True for 'data science' but False for 'science data' or 'data bachelors science')
  2. Each search term should get it's own field and be concatenated to the original df

What I've tried:

Failed: df = [x for x in df['job_description'] if x in bigrams]

Failed: df[bigrams] = [[any(w==term for w in lst) for term in bigrams] for lst in df['job_description']]

Failed: Could not adapt the approach here -> Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python

Failed: Could not get this one to adapt, either -> Compare two bigrams lists and return the matching bigram

Failed: This method is very close, but couldn't adapt it to bigrams -> Create new boolean fields based on specific terms appearing in a tokenized pandas dataframe

Thanks for any help you can provide!



Solution 1:[1]

You could use a regex and extractall:

regex = '|'.join('(%s)' % b.replace(' ', r'\s+') for b in bigrams)
matches = (df['job_description'].apply(' '.join)
           .str.extractall(regex).droplevel(1).notna()
           .groupby(level=0).max()
           )
matches.columns = bigrams

out = df.join(matches).fillna(False)

output:

                         job_description  data science  computer science  bachelors degree
0        [data, science, degree, expert]          True             False             False
1   [computer, science, degree, masters]         False              True             False
2  [bachelors, degree, computer, vision]         False             False              True
3            [data, processing, science]         False             False             False

generated regex:

'(data\\s+science)|(computer\\s+science)|(bachelors\\s+degree)'

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1