'Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
I have the following dataframe:
Index_Date A B C D
================================
2015-01-31 10 10 Nan 10
2015-02-01 2 3 Nan 22
2015-02-02 10 60 Nan 280
2015-02-03 10 100 Nan 250
Require:
Index_Date A B C D
================================
2015-01-31 10 10 10 10
2015-02-01 2 3 23 22
2015-02-02 10 60 290 280
2015-02-03 10 100 3000 250
Column C is derived for 2015-01-31 by taking value of D.
Then I need to use the value of C for 2015-01-31 and multiply by the value of A on 2015-02-01 and add B.
I have attempted an apply and a shift using an if else by this gives a key error.
Solution 1:[1]
Given a column of numbers:
lst = []
cols = ['A']
for a in range(100, 105):
lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df
A
0 100
1 101
2 102
3 103
4 104
You can reference the previous row with shift:
df['Change'] = df.A - df.A.shift(1)
df
A Change
0 100 NaN
1 101 1.0
2 102 1.0
3 103 1.0
4 104 1.0
Solution 2:[2]
numba
For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for loop and use the decorator @njit or (for older versions) @jit(nopython=True):
For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for loop:
from numba import jit
@jit(nopython=True)
def calculator_nb(a, b, d):
res = np.empty(d.shape)
res[0] = d[0]
for i in range(1, res.shape[0]):
res[i] = res[i-1] * a[i] + b[i]
return res
df['C'] = calculator_nb(*df[list('ABD')].values.T)
n = 10**5
df = pd.concat([df]*n, ignore_index=True)
# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T) # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T) # 444 ms per loop
Solution 3:[3]
Applying the recursive function on numpy arrays will be faster than the current answer.
df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new
Output
A B D C
0 1 1 1 1
1 2 2 2 4
2 3 3 3 15
3 4 4 4 64
4 5 5 5 325
Solution 4:[4]
Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.
Disclaimer: I know this solution is not standard, but I think it works well.
import pandas as pd
import numpy as np
data = np.array([[10, 2, 10, 10],
[10, 3, 60, 100],
[np.nan] * 4,
[10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
A B C D
=================================
2015-01-31 10 10 NaN 10
2015-02-01 2 3 NaN 22
2015-02-02 10 60 NaN 280
2015-02-03 10 100 NaN 250
def calculate(mul, add):
global value
value = value * mul + add
return value
value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
A B C D
=================================
2015-01-31 10 10 10 10
2015-02-01 2 3 23 22
2015-02-02 10 60 290 280
2015-02-03 10 100 3000 250
So basically we use a apply from pandas and the help of a global variable that keeps track of the previous calculated value.
Time comparison with a for loop:
data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan
df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']
%%timeit
for i in df.loc['2015-02-01':].index.date:
df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']
3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan
def calculate(mul, add):
global value
value = value * mul + add
return value
value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So 0.57 times faster on average.
Solution 5:[5]
It's an old question but the solution below (without a for loop) might be helpful:
def new_fun(df):
prev_value = df.iloc[0]["C"]
def func2(row):
# non local variable ==> will use pre_value from the new_fun function
nonlocal prev_value
new_value = prev_value * row['A'] + row['B']
prev_value = row['C']
return new_value
# This line might throw a SettingWithCopyWarning warning
df.iloc[1:]["C"] = df.iloc[1:].apply(func2, axis=1)
return df
df = new_fun(df)
Solution 6:[6]
In general, the key to avoiding an explicit loop would be to join (merge) 2 instances of the dataframe on rowindex-1==rowindex.
Then you would have a big dataframe containing rows of r and r-1, from where you could do a df.apply() function.
However the overhead of creating the large dataset may offset the benefits of parallel processing...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | kztd |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | iipr |
| Solution 5 | |
| Solution 6 | feetwet |
