'pandas: use map() to assign values based on range called from a dictionary
I tried to assign values (taken from a dataframe) that need 2 identifier columns to assign a value in a separate df. I tried everything I could think of and am closest now with my 2 identifiers packed into a tuple and together with their values into a dict but my syntax might be off (or I simply can't do it this way). Open for any other ways there might be (I'm not married to the tuple idea). Thank you, all.
import pandas as pd
# look up table for the 'condition'. Every 'cell' has 1 'baseline', 'washon', 'washoff' each
# or only one 'baseline'
lut_cols = ['filename', 'condition', 'start', 'end']
lut_vals = [['cell1', 'baseline', 1, 24],
['cell1', 'washon', 50, 70],
['cell1', 'washoff', 100, 120],
['cell2', 'baseline', 2, 22],
['cell2', 'washon', 50, 70],
['cell2', 'washoff', 100, 120],
['cell3', 'baseline', 1, 20]]
lut_df = pd.DataFrame(lut_vals, columns=lut_cols)
# the data that needs a 'condition' column mapped based on the 'filename' and 'record' values that
# need to be in the range 'start' to 'end' given in lut_df
data_cols = ['filename', 'record', 'data']
data_vals = [['cell1', 1, 'some_data0'],
['cell1', 1, 'some_data1'],
['cell1', 1, 'some_data2'],
['cell1', 25, 'some_data3'],
['cell1', 25, 'some_data4'],
['cell1', 101, 'some_data5'],
['cell2', 2, 'some_data6'],
['cell2', 2, 'some_data7'],
['cell2', 50, 'some_data8'],
['cell2', 50, 'some_dat9'],
['cell2', 80, 'some_dat10']]
df = pd.DataFrame(data_vals, columns=data_cols)
# the 'filename' and 'start' to 'end' together make up the unique identifiers keys (as tuple)
key = ()
lut_dict = {}
for i in range(lut_df.filename.shape[0]):
key = (lut_df.filename.iloc[i], range(lut_df.start.iloc[i], lut_df.end.iloc[i]))
lut_dict[key] = lut_df.condition[i]
# the 'record' column is now the index because I thought it would make life easier...
df.set_index('record', inplace=True)
df['condition'] = df['filename'].map(lut_dict)
print(df)
lut_dict for reference:
lut_dict:
{('cell1', range(1, 24)): 'baseline',
('cell1', range(50, 70)): 'washon',
('cell1', range(100, 120)): 'washoff',
('cell2', range(2, 22)): 'baseline',
('cell2', range(50, 70)): 'washon',
('cell2', range(100, 120)): 'washoff',
('cell3', range(1, 20)): 'baseline'}
Sadly, this is the output:
filename data condition
record
1 cell1 some_data0 NaN
1 cell1 some_data1 NaN
1 cell1 some_data2 NaN
25 cell1 some_data3 NaN
25 cell1 some_data4 NaN
101 cell1 some_data5 NaN
2 cell2 some_data6 NaN
2 cell2 some_data7 NaN
50 cell2 some_data8 NaN
50 cell2 some_dat9 NaN
80 cell2 some_dat10 NaN
Solution 1:[1]
You can use merge_asof. Since both DataFrames must be sorted by the key, we first sort each by the keys to merge by.
df = df.sort_values(by='record')
lut_df = lut_df.sort_values(by='start')
out = pd.merge_asof(df, lut_df, left_on='record', right_on='start', suffixes=('','_'))[['filename','data', 'condition']]
Output:
filename data condition
0 cell1 some_data0 baseline
1 cell1 some_data1 baseline
2 cell1 some_data2 baseline
3 cell2 some_data6 baseline
4 cell2 some_data7 baseline
5 cell1 some_data3 baseline
6 cell1 some_data4 baseline
7 cell2 some_data8 washon
8 cell2 some_dat9 washon
9 cell2 some_dat10 washon
10 cell1 some_data5 washoff
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
