'Create pairwise difference of rolling window of two dataframes
I have a time-series dataframe df with zeros and ones and I aim to compare the three most recent days df_recent with the historic time-series of df. In order to do that, I try to build pairwise differences of df_recent and df on each day with a rolling window and sum them up for each ID column.
df:
ID1 ID2 ID3
Date
2022-01-01 0 1 0
2022-01-02 0 1 0
2022-01-03 1 1 1
2022-01-04 0 0 0
2022-01-05 1 1 1
2022-01-06 0 0 1
2022-01-07 1 1 0
2022-01-08 0 0 0
2022-01-09 1 0 1 <-- 1
2022-01-10 0 1 1 <-- 2
2022-01-11 1 1 0 <-- 3
2022-01-12 1 0 0
df_recent:
ID1 ID2 ID3
Date
2022-01-10 0 1 1 <-- 1
2022-01-11 1 1 0 <-- 2
2022-01-12 1 0 0 <-- 3
df_pairwise_difference:
ID1 ID2 ID3
Date
2022-01-03 1 1 2
2022-01-04 1 0 2
2022-01-05 2 2 1
2022-01-06 1 1 3
2022-01-07 2 2 1
2022-01-08 1 1 0
2022-01-09 2 1 2
2022-01-10 1 3 3
2022-01-11 2 2 1 <-- example pairwise difference
2022-01-12 0 0 0
I try to build the difference by comparing pairwise the values (for an interval of three days) from df_recent with df on each day. An example for the pairwiese difference on 2022-01-11: 2 2 1:
- ID1: 2 = 1 (pairwise difference marked line 1) + 1 (pairwise difference marked line 2) + 0 (pairwise difference marked line 3)
- ID2: 2 = 1 (pairwise difference marked line 1) + 0 (pairwise difference marked line 2) + 1 (pairwise difference marked line 3)
- ID3: 1 = 0 (pairwise difference marked line 1) + 1 (pairwise difference marked line 2) + 0 (pairwise difference marked line 3)
I thought about a for loop, but wasn't able to come up with something useful.
Many thanks for your suggestion!
For reproducability:
import pandas as pd
df = pd.DataFrame({
'Date':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12'],
'ID1':[0,0,1,0,1,0,1,0,1,0,1,1],
'ID2':[1,1,1,0,1,0,1,0,0,1,1,0],
'ID3':[0,0,1,0,1,1,0,0,1,1,0,0]})
df = df.set_index('Date')
df_recent = df.iloc[-3:]
df_difference = pd.DataFrame({
'Date':['2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12'],
'ID1':[1,1,2,1,2,1,2,1,2,0],
'ID2':[1,0,2,1,2,1,1,3,2,0],
'ID3':[2,2,1,3,1,0,2,3,1,0]})
df_difference = df_difference.set_index('Date')
Solution 1:[1]
If you can install numba, here is one solution with .rolling using the 'table' method that allows you to roll over all columns at once:
import numpy as np
recent_array = df_recent.to_numpy()
def pairwise(arr):
if len(arr) < len(recent_array):
return np.array([np.nan, np.nan, np.nan])
return np.sum(np.abs(arr - recent_array), axis=0)
df_pairwise_difference = df.rolling(len(df_recent), method='table').apply(pairwise, engine='numba', raw=True)
print(df_pairwise_difference.dropna())
ID1 ID2 ID3
Date
2022-01-03 1.0 1.0 2.0
2022-01-04 1.0 0.0 2.0
2022-01-05 2.0 2.0 1.0
2022-01-06 1.0 1.0 3.0
2022-01-07 2.0 2.0 1.0
2022-01-08 1.0 1.0 0.0
2022-01-09 2.0 1.0 2.0
2022-01-10 1.0 3.0 3.0
2022-01-11 2.0 2.0 1.0
2022-01-12 0.0 0.0 0.0
Alternatively, you can iterate the windows:
import numpy as np
recent_array = df_recent.to_numpy()
df_pairwise_difference = df.copy()
for w in df.rolling(len(recent_array)):
if len(w) < len(recent_array):
df_pairwise_difference.loc[w.index[-1], w.columns] = np.nan
else:
df_pairwise_difference.loc[w.index[-1], w.columns] = np.sum(np.abs(w.to_numpy() - recent_array), axis=0)
print(df_pairwise_difference.dropna())
ID1 ID2 ID3
Date
2022-01-03 1.0 1.0 2.0
2022-01-04 1.0 0.0 2.0
2022-01-05 2.0 2.0 1.0
2022-01-06 1.0 1.0 3.0
2022-01-07 2.0 2.0 1.0
2022-01-08 1.0 1.0 0.0
2022-01-09 2.0 1.0 2.0
2022-01-10 1.0 3.0 3.0
2022-01-11 2.0 2.0 1.0
2022-01-12 0.0 0.0 0.0
The numba version is fast but has an initial overhead to load the library and translate the code. So, use it if your dataset is very large, otherwise the "classic" iterative version will still be faster because it has no initial overhead.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
