'pandas rolling window aggregating string column
I am struggling when computing string aggregation operation using rolling window on pandas.
I am given the current df, where t_dat is the purchase date, customer_id and article_id are self-explanatory.
| t_dat | customer_id | article_id |
|---|---|---|
| 2020-04-24 | 486230 | 781570001 |
| 2020-04-24 | 486230 | 598755030 |
| 2020-04-27 | 486230 | 836997001 |
| 2020-05-02 | 486230 | 687707005 |
| 2020-06-03 | 486230 | 741356002 |
and I'd like to group by customer_id and concatenate articles id over a weekly rolling window (e.g. article_ids column in table below. pandas doesn't seem to support rolling window aggregation for string columns therefore I tried resample, but it doesn't accomplish what I expect (look at table below for my expected result)
| t_dat | customer_id | article_id | article_ids |
|---|---|---|---|
| 2020-04-24 | 486230 | 781570001 | 598755030 836997001 |
| 2020-04-24 | 486230 | 598755030 | 781570001 836997001 |
| 2020-04-27 | 486230 | 836997001 | 836997001 687707005 |
| 2020-05-02 | 486230 | 687707005 | 687707005 |
| 2020-06-03 | 486230 | 741356002 | 741356002 |
My goal is to actually understand if there are purchase pattern among different article_ids (i.e. are some articles bought shortly after any client has purchased another article?)
To make it more explicit, I am trying to structure the problem in two steps:
- What are the articles that a customer has purchased within 7 days from any other article purchase? I want to repeat this exercise for each customer and each purchased product
- Once this is done, I want to identify those articles that are purchased more frequently in combination (within one week) from other products, so I can build a basic rec system.
Here I am looking for a solution to number 1.
I have tried both
df.groupby('customer_id').rolling('7D', on = 't_dat', min_periods = 1)['article_id'].agg(' '.join).reset_index()
or
df.groupby('customer_id').rolling('7D', on = 't_dat', min_periods = 1)['article_id'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
and, using resample,
df.groupby('customer_id').resample('7D', on = 't_dat')['article_id'].agg(' '.join).reset_index()
without success. First one because of error TypeError: sequence item 0: expected str instance, float found and, when I cast string type to article_id, it returns TypeError: must be real number, not str; second attempt because it doesn't return what I need with the proper offset (it takes week intervals starting from first occurrence in the dataset and then keep on setting the weekly intervals without rolling offset)
I have coded an alternative but it looks extremely slow and I would leverage on pandas vectorized operations to speed it up:
# for each article_id in every purchase, I want to check which other articles where bought within the following week
articles_list = df.groupby(['customer_id', 't_dat'])['article_id'].apply(list).reset_index()
def get_recommendations():
dict_recs = {}
for n, row in df.iterrows():
customer = row['customer_id']
date_purchase = row['t_dat']
articles_purchase = row['article_id']
df_clean = df[(df['customer_id'] == customer) & (df['t_dat'] <= date_purchase + timedelta(days=7)) & (df['t_dat'] >= date_purchase)]
articles_to_recommend = df_clean['article_id']
print("Iterating over {} row".format(n))
# print("Articles in scope are {} \n".format(articles_to_recommend))
for article in articles_purchase:
articles_list_to_iter = [i[j] for i in articles_to_recommend for j in range(len(i)) if i[j] != article]
# print("Articles preprocessed are {} \n".format(articles_list_to_iter))
if article not in dict_recs:
dict_recs[article] = articles_list_to_iter
else:
dict_recs[article].extend(articles_list_to_iter)
recs_list = {k: Counter(v).most_common(12) for k, v in dict_recs.items()}
return recs_list
Can you suggest any alternative I can use to accomplish what I am looking for?
Solution 1:[1]
I was able to aggregate by the day. Create a second dataframe and accumulate by customer all articles per day. Use the pd.Grouper to create your 7 day rolling window!
data="""
t_dat customer_id article_id
2020-04-24 486230 781570001
2020-04-24 486230 598755030
2020-04-27 486230 836997001
2020-05-02 486230 687707005
2020-06-03 486230 741356002
"""
df = pd.read_csv(StringIO(data), sep='\t')
df['t_dat'] = pd.to_datetime(df['t_dat'])
df = df.sort_values(by=['t_dat'])
#grouped = df.groupby(['t_dat', 'customer_id']).agg({'article_id': lambda x: list(x)})
#grouped=grouped.reset_index()
#df=pd.DataFrame(grouped)
df = df.set_index('t_dat')
print(df)
df = df.groupby(['customer_id', pd.Grouper(level='t_dat', freq='7D')])['article_id'].apply(list).reset_index()
print(df)
output:
customer_id t_dat article_id
0 486230 2020-04-24 [781570001, 598755030, 836997001]
1 486230 2020-05-01 [687707005]
2 486230 2020-05-29 [741356002]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
