'Create new variable using vectors of upper and lower bounds
I have a DataFrame like:
df = pd.DataFrame([1,2,1,3,4,5], columns=['data'])
and a set of categories and bounds:
catBounds = pd.DataFrame([['a',1,2],
['b',3,4]],
columns=['cat','lb','ub'])
that I would like to add a column of categories to the first DataFrame where if the value lies between bounds 1 and 2, the new column is assigned "a" and if the value lies between the bounds 3 and 4, the new column is assigned "b". An inelegant way of doing this would be to loop over the rows of catBounds and making assignments:
for index, row in catBounds.iterrows():
df.loc[(df.data>=row.lb) & (df.data<=row.ub), 'cat'] = row['cat']
getting:
| data | cat | |
|---|---|---|
| 0 | 1 | a |
| 1 | 2 | a |
| 2 | 1 | a |
| 3 | 3 | b |
| 4 | 4 | b |
| 5 | 5 | NaN |
Is there a cleaner, more Pythonic way of doing this?
Solution 1:[1]
One option is to melt catBounds to create a mapping from category to bounds, then use map:
df['cat'] = df['data'].map(catBounds.melt('cat').set_index('value')['cat'])
A more general approach could be to use numpy.select:
import numpy as np
df['cat'] = np.select([df['data'].between(lb, ub) for lb,ub in zip(catBounds['lb'], catBounds['ub'])], catBounds['cat'], np.nan)
Output:
data cat
0 1 a
1 2 a
2 1 a
3 3 b
4 4 b
5 5 NaN
Solution 2:[2]
You can melt the data and merge:
df.merge(catBounds
.melt(id_vars='cat', value_name='data')
.drop(columns='variable'),
on='data', how='left')
Output:
data cat
0 1 a
1 2 a
2 1 a
3 3 b
4 4 b
5 5 NaN
Solution 3:[3]
This looks like a non-equi join - specifically a range join; one option is the conditional_join from pyjanitor to get the relevant data:
# pip install pyjanitor
# you can install the dev version as well
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.conditional_join(
catBounds,
# col from left, col from right, comparator
('data', 'lb', '>='),
('data', 'ub', '<='),
how = 'left')
.loc[:, ['data', 'cat']]
)
data cat
0 1 a
1 2 a
2 1 a
3 3 b
4 4 b
5 5 NaN
If your bounds are exact matches, there is no point using this, a merge or map will be more efficient
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | mozway |
| Solution 3 |
