'pandas rolling window aggregating string column

I am struggling when computing string aggregation operation using rolling window on pandas.

I am given the current df, where t_dat is the purchase date, customer_id and article_id are self-explanatory.

t_dat customer_id article_id
2020-04-24 486230 781570001
2020-04-24 486230 598755030
2020-04-27 486230 836997001
2020-05-02 486230 687707005
2020-06-03 486230 741356002

and I'd like to group by customer_id and concatenate articles id over a weekly rolling window (e.g. article_ids column in table below. pandas doesn't seem to support rolling window aggregation for string columns therefore I tried resample, but it doesn't accomplish what I expect (look at table below for my expected result)

t_dat customer_id article_id article_ids
2020-04-24 486230 781570001 598755030 836997001
2020-04-24 486230 598755030 781570001 836997001
2020-04-27 486230 836997001 836997001 687707005
2020-05-02 486230 687707005 687707005
2020-06-03 486230 741356002 741356002

My goal is to actually understand if there are purchase pattern among different article_ids (i.e. are some articles bought shortly after any client has purchased another article?)

To make it more explicit, I am trying to structure the problem in two steps:

  1. What are the articles that a customer has purchased within 7 days from any other article purchase? I want to repeat this exercise for each customer and each purchased product
  2. Once this is done, I want to identify those articles that are purchased more frequently in combination (within one week) from other products, so I can build a basic rec system.

Here I am looking for a solution to number 1.

I have tried both

df.groupby('customer_id').rolling('7D', on = 't_dat', min_periods = 1)['article_id'].agg(' '.join).reset_index() 

or

df.groupby('customer_id').rolling('7D', on = 't_dat', min_periods = 1)['article_id'].apply(lambda x: ' '.join(x.astype(str))).reset_index() 

and, using resample,

df.groupby('customer_id').resample('7D', on = 't_dat')['article_id'].agg(' '.join).reset_index() 

without success. First one because of error TypeError: sequence item 0: expected str instance, float found and, when I cast string type to article_id, it returns TypeError: must be real number, not str; second attempt because it doesn't return what I need with the proper offset (it takes week intervals starting from first occurrence in the dataset and then keep on setting the weekly intervals without rolling offset)

I have coded an alternative but it looks extremely slow and I would leverage on pandas vectorized operations to speed it up:

# for each article_id in every purchase, I want to check which other articles where bought within the following week

articles_list = df.groupby(['customer_id', 't_dat'])['article_id'].apply(list).reset_index()

def get_recommendations():

    dict_recs = {}

    for n, row in df.iterrows():
        customer = row['customer_id']
        date_purchase = row['t_dat']
        articles_purchase = row['article_id']
        df_clean = df[(df['customer_id'] == customer) & (df['t_dat'] <= date_purchase + timedelta(days=7)) & (df['t_dat'] >= date_purchase)]
        articles_to_recommend = df_clean['article_id']
        
        print("Iterating over {} row".format(n))
        # print("Articles in scope are {} \n".format(articles_to_recommend))
        
        for article in articles_purchase:
            articles_list_to_iter = [i[j] for i in articles_to_recommend for j in range(len(i)) if i[j] != article]
            # print("Articles preprocessed are {} \n".format(articles_list_to_iter))
            if article not in dict_recs:
                dict_recs[article] = articles_list_to_iter
            else:
                dict_recs[article].extend(articles_list_to_iter)
    
    recs_list = {k: Counter(v).most_common(12) for k, v in dict_recs.items()}

    return recs_list

Can you suggest any alternative I can use to accomplish what I am looking for?



Solution 1:[1]

I was able to aggregate by the day. Create a second dataframe and accumulate by customer all articles per day. Use the pd.Grouper to create your 7 day rolling window!

data="""
t_dat   customer_id article_id
2020-04-24  486230  781570001
2020-04-24  486230  598755030
2020-04-27  486230  836997001
2020-05-02  486230  687707005
2020-06-03  486230  741356002
"""

df = pd.read_csv(StringIO(data), sep='\t')
df['t_dat'] = pd.to_datetime(df['t_dat'])
df = df.sort_values(by=['t_dat'])
#grouped = df.groupby(['t_dat', 'customer_id']).agg({'article_id': lambda x: list(x)})

#grouped=grouped.reset_index()
#df=pd.DataFrame(grouped)
df = df.set_index('t_dat')
print(df)
df = df.groupby(['customer_id', pd.Grouper(level='t_dat', freq='7D')])['article_id'].apply(list).reset_index()
print(df)

output:

customer_id      t_dat                         article_id
0       486230 2020-04-24  [781570001, 598755030, 836997001]
1       486230 2020-05-01                        [687707005]
2       486230 2020-05-29                        [741356002]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1