'Best method for non-regular index-based interpolation on grouped dataframes
Problem statement
I had the following problem:
- I have samples that ran independent tests. In my dataframe, tests of sample with the same "test name" are also independent. So the couple
(test,sample)is independent and unique. - data are collected at non regular sampling rates, so we're speaking about unequaly spaced indices. This "time series" index is called
unreg_idxin the example. For the sake of simplicity, it is a float between 0 & 1. - I want to figure out what the value at a specific index, e.g. for
unreg_idx=0.5. If the value is missing, I just want a linear interpolation that depends on the index. If extrapolating because the value is at an extremum in the sortedunreg_idxof the group(test,sample), it can leaveNaN.
Note the following from pandas documentation:
Please note that only method='linear' is supported for DataFrame/Series with a MultiIndex.
’linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
The only solution I found is long, complex and slow. I am wondering if I am missing out on something, or on the contrary something is missing from the pandas library. I believe this is a typical issue in scientific and engineering fields to have independent tests on various samples with non regular indices.
What I tried
sample data set preparation
This part is just for making an example
import pandas as pd
import numpy as np
tests = (f'T{i}' for i in range(20))
samples = (chr(i) for i in range(97,120))
idx = pd.MultiIndex.from_product((tests,samples),names=('tests','samples'))
idx
dfs=list()
for ids in idx:
group_idx = pd.MultiIndex.from_product(((ids[0],),(ids[1],),tuple(np.random.random_sample(size=(90,))))).sort_values()
dfs.append(pd.DataFrame(1000*np.random.random_sample(size=(90,)),index=group_idx))
df = pd.concat(dfs)
df = df.rename_axis(index=('test','sample','nonreg_idx')).rename({0:'value'},axis=1)
The (bad) solution
add_missing = df.index.droplevel('nonreg_idx').unique().to_frame().reset_index(drop=True)
add_missing['nonreg_idx'] = .5
add_missing = pd.MultiIndex.from_frame(add_missing)
added_missing = df.reindex(add_missing)
df_full = pd.concat([added_missing.loc[~added_missing.index.isin(df.index)], df])
df_full.sort_index(inplace=True)
def interp_fnc(group):
try:
return group.reset_index(['test','sample']).interpolate(method='slinear').set_index(['test','sample'], append=True).reorder_levels(['test','sample','value']).sort_index()
except:
return group
grouped = df_full.groupby(level=['test','sample'])
df_filled = grouped.apply(
interp_fnc
)
Here, the wanted values are in df_filled. So I can do df_filled.loc[(slice(None), slice(None), .5),'value'] to get what I need for each sample/test.
I would have expected to be able to do the same within 1 or maximum 2 lines of code. I have 14 here. apply is quite a slow method. I can't even use numba.
Question
Can someone propose a better solution? If you think there is no better alternative, please comment and I'll open an issue...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
