'Create new variable using vectors of upper and lower bounds

I have a DataFrame like:

df = pd.DataFrame([1,2,1,3,4,5], columns=['data'])

and a set of categories and bounds:

catBounds = pd.DataFrame([['a',1,2],
                          ['b',3,4]],
                         columns=['cat','lb','ub'])

that I would like to add a column of categories to the first DataFrame where if the value lies between bounds 1 and 2, the new column is assigned "a" and if the value lies between the bounds 3 and 4, the new column is assigned "b". An inelegant way of doing this would be to loop over the rows of catBounds and making assignments:

for index, row in catBounds.iterrows():
    df.loc[(df.data>=row.lb) & (df.data<=row.ub), 'cat'] = row['cat']

getting:

data cat
0 1 a
1 2 a
2 1 a
3 3 b
4 4 b
5 5 NaN

Is there a cleaner, more Pythonic way of doing this?



Solution 1:[1]

One option is to melt catBounds to create a mapping from category to bounds, then use map:

df['cat'] = df['data'].map(catBounds.melt('cat').set_index('value')['cat'])

A more general approach could be to use numpy.select:

import numpy as np
df['cat'] = np.select([df['data'].between(lb, ub) for lb,ub in zip(catBounds['lb'], catBounds['ub'])], catBounds['cat'], np.nan)

Output:

   data  cat
0     1    a
1     2    a
2     1    a
3     3    b
4     4    b
5     5  NaN

Solution 2:[2]

You can melt the data and merge:

df.merge(catBounds
         .melt(id_vars='cat', value_name='data')
         .drop(columns='variable'),
         on='data', how='left')

Output:

   data  cat
0     1    a
1     2    a
2     1    a
3     3    b
4     4    b
5     5  NaN

Solution 3:[3]

This looks like a non-equi join - specifically a range join; one option is the conditional_join from pyjanitor to get the relevant data:

# pip install pyjanitor
# you can install the dev version as well
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd

(df
.conditional_join(
    catBounds, 
    # col from left, col from right, comparator
    ('data', 'lb', '>='), 
    ('data', 'ub', '<='), 
    how = 'left')
.loc[:, ['data', 'cat']]
)

   data  cat
0     1    a
1     2    a
2     1    a
3     3    b
4     4    b
5     5  NaN

If your bounds are exact matches, there is no point using this, a merge or map will be more efficient

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 mozway
Solution 3