'Efficient Merging of Python Pandas Multiindex Dataframes
I have two dataframes: The first is a multiindex frame that contains returns of two stocks and is denoted by:
import pandas as pd
from pandas import IndexSlice as idx
import numpy as np
dates = pd.date_range('2000-12-31', periods=6, freq='M', name='Date')
arrays = [dates.tolist()+dates.tolist(),["10000"]*6+["10001"]*6]
index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["Date", "Stock"])
df1 = pd.Series(np.random.randn(12), index=index).to_frame('Return').sort_index()
The second frame is denoted by:
and contains the code of the stocks for given time spans.
data = {'Stock':['10000','10000','10000','10001'],
'Start':['1990-12-31', '2001-03-05', '2001-05-19', '1991-03-31'],
'End':['2001-03-04', '2001-05-18', '2002-01-31', '2001-04-03'],
'Code':['10','11','10','10']}
df2 = pd.DataFrame(data)
df2 = df2.set_index('Stock').sort_index()
df2['Start'] = pd.to_datetime(df2['Start'])
df2['End'] = pd.to_datetime(df2['End'])
The start date of the respective span is given in the column 'Start', the end date in the column 'End'. I would like to add the stock codes (given in column 'Code' of df2) to df1 such that if the timeindex (of df1) of a stock is between the 'Start' and 'End' column in df2, then the corresponding entry in the column 'Code' of df1 contains the code given in df2. If no code is specified or no time span in df2 covers the timeindex in df1, then np.nan should be assigned.
Consider the example Date='2001-03-31' and Stock='10000'. According to the second row of df2, we see that Stock 10000 has from 2001-03-05 until 2001-05-18 the code 11. Therefore, the entry in the row ('2001-03-31','10000') and in the column 'Code' of df1 should contain 11.
The entire df1 is shown here:
The following function has the desired functionality but is (even with parallelization) very slow:
df1s = df1.swaplevel().sort_index().copy() # reorder s.t. date is first
df1['Code'] = np.nan
for p_tmp in df2.index.drop_duplicates().values:
d_tmp = df1s.loc[idx[p_tmp,:]].index.get_level_values(0) # End of each month
output = np.array([np.nan]*df1s.loc[idx[p_tmp,:]].index.size)
if isinstance(df2.loc[p_tmp], pd.Series):
b_mask = np.array((d_tmp >= pd.to_datetime(df2.loc[p_tmp]['Start'])) & (d_tmp <= pd.to_datetime(df2.loc[p_tmp]['End'])))
output[b_mask] = df2.loc[p_tmp]['Code']
else:
for index, row in df2.loc[p_tmp].iterrows():
b_mask = np.array((d_tmp >= pd.to_datetime(row['Start'])) & (d_tmp <= pd.to_datetime(row['End'])))
output[b_mask] = row['Code']
df1s.loc[p_tmp,'Code'] = output.copy()
df1 = df1s.swaplevel().sort_index().copy()
Can please someone give me hints how I can achieve a speed-up? :)
Solution 1:[1]
An solution could be like that:
df_tmp = pd.DataFrame(index=df1.index).join(df2, on='Stock', how='left')
blist_validlink = (df_tmp['Start']<=df_tmp.index.get_level_values('Date'))&(df_tmp['End']>=df_tmp.index.get_level_values('Date'))
df_tmp = df_tmp[blist_validlink]
df1 = df1.join(df_tmp['Code'], on=['Date','Stock'])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | LisaBinder |



