'Efficient way to manipulate values of a pandas dataframe
I am dealing with 2 huge dataframes and I need to perform a specific operation to retrieve the most frequent value of one of the two dataframe for each unique id in the first one. I will explain it better with an example.
Suppose I have to dataframes, the first one will be called df_id, the second one df_values
df_values:
ids values
0 id1 10
1 id2 20
2 id1 10
3 id1 30
4 id2 40
...
df_id:
ids desc
0 id1 a product
1 id2 a product
2 id3 a product
In the real dataframes I have additional columns, but for the sake of clarity I omitted them.
Now, df_id contains references of all the ids I need, while df_values contains the (multiple) values associated to each id.
My scope is to create a dictionary that, for each distinct id, report the most frequent value. In case it does not appear in df_values, then I write a None instead of a value.
values_dict
{'id1': 10, 'id2': 20, 'id3': None}
My attempt to solve this was through something like that:
from collections import Counter
import numpy as np
def Most_Common(lst):
data = Counter(lst)
return data.most_common(1)[0][0]
dict_val = {}
for ar in pd.unique(df_id['ids']):
df_art = df_values.loc[df_values['ids'] == ar]
print("Done", ar)
val = Most_Common(df_art['values']) if not df_art.empty else None
dict_val[ar] = val
Code of Most_Common function from this answer
However, this solution seems to not escalate well if my dataframes are VERY large, which is my case. I even tried to use multiprocessing library:
from collections import Counter
import multiprocessing as mp
import numpy as np
dict_val = {}
unique_ids = pd.unique(df_id["ids"])
unique_ids = np.array_split(unique_ids, 5)
def register_value(ids):
for ar in ids:
df_art = df_values.loc[df_values['ids'] == ar]
print("Done", ar)
val = Most_Common(df_art['values']) if not df_art.empty else None
dict_val[ar] = val
with mp.Pool(processes = 5) as p:
p.map(register_value, unique_ids)
But the code is still running for about 40 minutes... maybe I have done something wrong. Is there an efficient solution, which may be also applied in case of multiprocessing? Thank you.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
