'Vectorizing lookups and update on a pandas df
I have a df and it's corresponding dictionary that I created from a groupby and attached it to a key. The df has a lot of columns, but here is the important piece.
df:
key change_x x
0 2012_1_23_1 0 1
...
22 2012_1_23_1 0 1
23 2012_1_23_0 1 0
24 2012_1_23_0 0 0
...
46 2012_1_23_0 0 0
47 2012_1_23_1 1 0
47 2012_1_23_1 0 0
...
70 2012_1_23_1 0 0
71 2012_1_23_1 1 0
dict_df:
key x
0 2012_1_23_0 1
1 2012_1_23_1 0
The dict dataframe I converted into a dictionary using df.to_dict['records']
dict:
[{'key': '2012_1_23_0', 'x': 1},
{'key': '2012_1_23_1', 'x': 0}]
Both the dictionary and the df have this key pair, in 'key'. I've created a loop that takes the change_x variable and use that to increment x if 1, and set the variable of x to the key value of x if 0, but it takes 2.5s for 20k rows, and on my larger df with 400k rows it takes over 3 minutes. This would be the df and dictionary after the loop takes place, if we assume the same data.
loop code:
def search_key_in_dicts(key, dict):
for d in dict:
if d['key'] == key:
return d
return None
def update_value_in_dicts(key, dict, col, value):
dict_key = search_key_in_dicts(key, dict)
dict_key.update({col : value})
def increment_x_value(key, dict):
update_value_in_dicts(key, dict, 'x', search_key_in_dicts(key, dict).get('x') + 1)
return search_key_in_dicts(key, dict).get('x')
for i in range(0,len(data)):
row = data.iloc[i]
if change_x == 1:
increment_x_value(row.key, dict)
data.at[row.name, 'x'] = (search_key_in_dicts(row.key, dict).get('x'))
df:
key change_x x
0 2012_1_23_1 0 1
...
22 2012_1_23_1 0 1
23 2012_1_23_0 1 1
24 2012_1_23_0 0 1
...
46 2012_1_23_0 0 1
47 2012_1_23_1 1 2
48 2012_1_23_1 0 2
...
70 2012_1_23_1 0 2
71 2012_1_23_1 1 3
dictionary:
key x
0 2012_1_23_0 3
1 2012_1_23_1 1
I know the loop function works, and I guess I could live with a 3 minute performance time if I do have to run this again. I was trying to get it faster using np.where or pd.apply, but neither worked. This is what I tried before:
np.where(df['change_x'] == 1, increment_x(df['key'], dict), search_key_in_dicts(df['key'], dict)
but I got this error: ValueError: The truth value of a Series is ambiguous. - my best guess is because df['key'] can map to so many values of change_x.
Same for this apply function:
def change_x_apply(key, change_x):
if change_x== 1:
increment_x_value(key, dict)
return search_key_in_dicts(key, dict).get('x')
df.apply(lambda x: change_x_apply(key = df['key'], x = df['change_x']), axis=1)
Kind of at a loss at what I can do to get this runtime down. Any suggestions?
Solution 1:[1]
You need rewrite your code completely:
#create default index
df = df.reset_index(drop=True)
#counter column by key
df['g'] = df.groupby('key').cumcount()
#merge df1 (used for generate dict) by first match, if no match set 0
df['new'] = (df.merge(df1.assign(g=0)
.rename(columns={'x':'new'}), on=['key', 'g'], how='left'))['new']
.fillna(0)
.astype(int))
#sum both columns and use cumulative sum per key groups
df['x'] = (df['new'] + df['change_x']).groupby(df['key']).cumsum()
#delete helper columns, commanted for debugging
# df = df.drop(['g','new'], axis=1)
print (df)
key change_x x g new
0 2012_1_23_1 0 0 0 0
1 2012_1_23_1 0 0 1 0
2 2012_1_23_0 1 2 0 1
3 2012_1_23_0 0 2 1 0
4 2012_1_23_0 0 2 2 0
5 2012_1_23_1 1 1 2 0
6 2012_1_23_1 0 1 3 0
7 2012_1_23_1 0 1 4 0
8 2012_1_23_1 1 2 5 0
df1 = df.drop_duplicates('key', keep='last')[['key','x']]
print (df1)
key x
4 2012_1_23_0 2
8 2012_1_23_1 2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
