'Interpolate groupwise - how to improve performance

Please consider the following df:

import pandas as pd
data = {'year':  [2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011],
        'bfsId': [1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2],
         'income': [15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000, 15000, 20000, 25000, 30000],
         'taxes_perc': [0.74, 1.715, 3.192, 4.09, 0.813333333, 1.905, 3.548, 4.543333333, 0.753333333, 1.775, 3.308, 4.183333333, 0.813333333, 1.94, 3.608, 4.563333333],
         'perc_inc': [17375, 23625, 33875, 33875, 17375, 23625, 33875, 33875, 17375, 23500, 33625, 33625, 17375, 23500, 33625, 33625]
        }

df = pd.DataFrame(data)

I want to apply scipy.interpolate.interp1d for each year and bfsId separately. I came up with a loop which does what I intend. Unfortunately, performance seems to be rather poor. The problem is that in my real data I have more than 20 years and more than 2000 bfsIds (moreover, I have about 20 datasets).

This is my loop:

import scipy.interpolate 
df_interpol = pd.DataFrame()

for j in range(2010, 2012):
    df_jahr = df[(df.year == j)]
    for i in df_jahr.bfsId.unique():
        df_jahr_gem = df_jahr[df_jahr.bfsId == i].copy()
        y = df_jahr_gem.taxes_perc
        x = df_jahr_gem.income
        y_interp = scipy.interpolate.interp1d(x, y, fill_value="extrapolate")
        df_jahr_gem['tax_rate_interpol'] = pd.Series(y_interp(df_jahr_gem.perc_inc)).values
        df_interpol = df_interpol.append(df_jahr_gem)

Any ideas how to rewrite the code (perhaps with groupby and by using a function, but I was not able to implement it).



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source