'Best method for non-regular index-based interpolation on grouped dataframes

Problem statement

I had the following problem:

  • I have samples that ran independent tests. In my dataframe, tests of sample with the same "test name" are also independent. So the couple (test,sample) is independent and unique.
  • data are collected at non regular sampling rates, so we're speaking about unequaly spaced indices. This "time series" index is called unreg_idx in the example. For the sake of simplicity, it is a float between 0 & 1.
  • I want to figure out what the value at a specific index, e.g. for unreg_idx=0.5. If the value is missing, I just want a linear interpolation that depends on the index. If extrapolating because the value is at an extremum in the sorted unreg_idx of the group (test,sample), it can leave NaN.

Note the following from pandas documentation:

Please note that only method='linear' is supported for DataFrame/Series with a MultiIndex.

’linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.

The only solution I found is long, complex and slow. I am wondering if I am missing out on something, or on the contrary something is missing from the pandas library. I believe this is a typical issue in scientific and engineering fields to have independent tests on various samples with non regular indices.

What I tried

sample data set preparation

This part is just for making an example

import pandas as pd

import numpy as np


tests = (f'T{i}' for i in range(20))
samples = (chr(i) for i in range(97,120))

idx = pd.MultiIndex.from_product((tests,samples),names=('tests','samples'))

idx

dfs=list()
for ids in idx:
    group_idx = pd.MultiIndex.from_product(((ids[0],),(ids[1],),tuple(np.random.random_sample(size=(90,))))).sort_values()
    dfs.append(pd.DataFrame(1000*np.random.random_sample(size=(90,)),index=group_idx))


df = pd.concat(dfs)

df = df.rename_axis(index=('test','sample','nonreg_idx')).rename({0:'value'},axis=1)

The (bad) solution

add_missing = df.index.droplevel('nonreg_idx').unique().to_frame().reset_index(drop=True)

add_missing['nonreg_idx'] = .5

add_missing = pd.MultiIndex.from_frame(add_missing)

added_missing = df.reindex(add_missing)

df_full = pd.concat([added_missing.loc[~added_missing.index.isin(df.index)], df])

df_full.sort_index(inplace=True)

def interp_fnc(group):
    try:
        return group.reset_index(['test','sample']).interpolate(method='slinear').set_index(['test','sample'], append=True).reorder_levels(['test','sample','value']).sort_index()
    except:
        return group

grouped = df_full.groupby(level=['test','sample'])

df_filled = grouped.apply(
    interp_fnc
)

Here, the wanted values are in df_filled. So I can do df_filled.loc[(slice(None), slice(None), .5),'value'] to get what I need for each sample/test.

I would have expected to be able to do the same within 1 or maximum 2 lines of code. I have 14 here. apply is quite a slow method. I can't even use numba.

Question

Can someone propose a better solution? If you think there is no better alternative, please comment and I'll open an issue...



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source