'Vectorizing lookups and update on a pandas df

I have a df and it's corresponding dictionary that I created from a groupby and attached it to a key. The df has a lot of columns, but here is the important piece.

df:

    key          change_x  x
0   2012_1_23_1  0         1
...
22  2012_1_23_1  0         1
23  2012_1_23_0  1         0
24  2012_1_23_0  0         0
...
46  2012_1_23_0  0         0
47  2012_1_23_1  1         0
47  2012_1_23_1  0         0
...
70  2012_1_23_1  0         0
71  2012_1_23_1  1         0

dict_df:

     key            x
0    2012_1_23_0    1
1    2012_1_23_1    0

The dict dataframe I converted into a dictionary using df.to_dict['records']

dict:

[{'key': '2012_1_23_0', 'x': 1},
 {'key': '2012_1_23_1', 'x': 0}]

Both the dictionary and the df have this key pair, in 'key'. I've created a loop that takes the change_x variable and use that to increment x if 1, and set the variable of x to the key value of x if 0, but it takes 2.5s for 20k rows, and on my larger df with 400k rows it takes over 3 minutes. This would be the df and dictionary after the loop takes place, if we assume the same data.

loop code:

def search_key_in_dicts(key, dict):
    for d in dict:
        if d['key'] == key:
            return d
    return None
def update_value_in_dicts(key, dict, col, value):
    dict_key = search_key_in_dicts(key, dict)
    dict_key.update({col : value})
def increment_x_value(key, dict):
    update_value_in_dicts(key, dict, 'x', search_key_in_dicts(key, dict).get('x') + 1)
    return search_key_in_dicts(key, dict).get('x')
for i in range(0,len(data)):
    row = data.iloc[i]
    if change_x == 1:
        increment_x_value(row.key, dict)
    data.at[row.name, 'x'] = (search_key_in_dicts(row.key, dict).get('x'))

df:

    key          change_x  x
0   2012_1_23_1  0         1
...
22  2012_1_23_1  0         1
23  2012_1_23_0  1         1
24  2012_1_23_0  0         1
...
46  2012_1_23_0  0         1
47  2012_1_23_1  1         2
48  2012_1_23_1  0         2
...
70  2012_1_23_1  0         2
71  2012_1_23_1  1         3

dictionary:

     key            x
0    2012_1_23_0    3
1    2012_1_23_1    1

I know the loop function works, and I guess I could live with a 3 minute performance time if I do have to run this again. I was trying to get it faster using np.where or pd.apply, but neither worked. This is what I tried before:

np.where(df['change_x'] == 1, increment_x(df['key'], dict), search_key_in_dicts(df['key'], dict)

but I got this error: ValueError: The truth value of a Series is ambiguous. - my best guess is because df['key'] can map to so many values of change_x.

Same for this apply function:

def change_x_apply(key, change_x):
        if change_x== 1:
            increment_x_value(key, dict)
        return search_key_in_dicts(key, dict).get('x')
df.apply(lambda x: change_x_apply(key = df['key'], x = df['change_x']), axis=1)

Kind of at a loss at what I can do to get this runtime down. Any suggestions?

Solution 1:^[1]

You need rewrite your code completely:

#create default index
df = df.reset_index(drop=True)
#counter column by key
df['g'] = df.groupby('key').cumcount()

#merge df1 (used for generate dict) by first match, if no match set 0
df['new'] = (df.merge(df1.assign(g=0)
              .rename(columns={'x':'new'}), on=['key', 'g'], how='left'))['new']
              .fillna(0)
              .astype(int))

#sum both columns and use cumulative sum per key groups
df['x'] = (df['new'] + df['change_x']).groupby(df['key']).cumsum()

#delete helper columns, commanted for debugging
# df = df.drop(['g','new'], axis=1)
print (df)
           key  change_x  x  g  new
0  2012_1_23_1         0  0  0    0
1  2012_1_23_1         0  0  1    0
2  2012_1_23_0         1  2  0    1
3  2012_1_23_0         0  2  1    0
4  2012_1_23_0         0  2  2    0
5  2012_1_23_1         1  1  2    0
6  2012_1_23_1         0  1  3    0
7  2012_1_23_1         0  1  4    0
8  2012_1_23_1         1  2  5    0

df1 = df.drop_duplicates('key', keep='last')[['key','x']]
print (df1)
           key  x
4  2012_1_23_0  2
8  2012_1_23_1  2

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Vectorizing lookups and update on a pandas df

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]