'Match IntervalIndex as part of a MultiIndex

I have a problem setup using a pandas IntervalIndex, similar to those posed here and here, but with differences that have thus far prevented me from successfully implementing their solutions.

I have two DataFrames, constructed as such:

df1

    month   group   time    distance
0   1       A       10      100
1   1       A       20      120
2   1       A       25      110
3   1       B       5       90
4   1       B       7       99
5   1       B       17      60
6   1       C       30      200
7   1       C       35      170
8   2       A       5       40
9   2       A       10      27
10  2       B       17      33
11  2       B       19      65

df1.set_index(['month', 'group', 'time'], inplace=True)

df2

    month   group   start   end     period
0   1       A       5       15      1
1   1       A       15      21      2
2   1       A       21      30      3
3   1       B       2       10      1
4   1       B       10      20      2
5   2       A       3       8       1
6   2       A       8       15      2
7   2       B       10      20      1

df2.set_index(['month', 'group'], inplace=True)

Where start and end in df2 can be used to define an interval. My real data are much larger, with df1 on the order of 100,000 rows and df2 on the order of 10,000 rows.

What I would like to do is assign the period variable to df1, matching on month, group, and if df1['time'] is in the interval in df2.

There are two complications over the problems posed in the questions referenced above:

  1. The IntervalIndex needs to be just one part of a MultiIndex, and the matching needs to happen on all three levels of the index.

  2. Sometimes a time exists in df1 that doesn't match an interval in df2. I don't have a strong preference for if those rows are populated with NaN or another missing value, or if they are just removed altogether, but what I can't have is a KeyError.

My first steps were:

df2Index = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='left')
df2.set_index(df2Index, append=True, inplace=True)

to apply the IntervalIndex to df2 while keeping month and group as higher level indices.

I tried a few approaches, such as

period = df2.loc[df2.index.get_indexer(df1.index), 'period']
period = df2.get_loc(df1.index), 'period']

But I have been unable to get the correct indexing behavior. For completeness, my desired result is:

    month   group   time    distance    period
0   1       A       10      100         1
1   1       A       20      120         2
2   1       A       25      110         3
3   1       B       5       90          1
4   1       B       7       99          1
5   1       B       17      60          2
6   1       C       30      200         NaN
7   1       C       35      170         NaN
8   2       A       5       40          1
9   2       A       10      27          2
10  2       B       17      33          1
11  2       B       19      65          1

(or the same table but without the two rows where period is NaN).

My backup plan is to loop over the relevant combinations of month and group, subset the two tables appropriately, use the IntervalIndex on these subset tables, then recombine them. But that seems like the wrong solution.



Solution 1:[1]

I would merge the two frames together using a left join, then filter to show only those rows where time is within the desired start and end period:

import pandas as pd

# setup the frame
df1 = pd.DataFrame(
    data={
        'month': [1,1,1,1,1,1,1,1,2,2,2,2],
        'group': ['A','A','A','B','B','B','C','C','A','A','B','B'],
        'time': [10,20,25,5,7,17,30,35,5,10,17,19],
        'distance': [100,120,110,90,99,60,200,170,40,27,33,65],
    })

df2 = pd.DataFrame(
    data={
        'month': [1,1,1,1,1,2,2,2],
        'group': ['A','A','A','B','B','A','A','B'],
        'start': [5,15,21,2,10,3,8,10],
        'end': [15,21,30,10,20,8,15,20],
        'period': [1,2,3,1,2,1,2,1],
    })

# merge, and filter
df = df1.merge(df2, how='left', on=['month','group'])
df = df[(df.time >= df.start) & (df.time <= df.end)][['month','group','time','distance','period']].reset_index(drop=True)

print df

    month   group   time    distance    period
0   1       A       10      100         1.0
1   1       A       20      120         2.0
2   1       A       25      110         3.0
3   1       B       5       90          1.0
4   1       B       7       99          1.0
5   1       B       17      60          2.0
6   2       A       5       40          1.0
7   2       A       10      27          2.0
8   2       B       17      33          1.0
9   2       B       19      65          1.0

Note that the above frame excludes NaNs. If you want them, update the filter criteria: df[((df.time >= df.start) & (df.time <= df.end)) | (df.period.isnull())]

Solution 2:[2]

I came across this recently and since I don't haven't come across a solution using the MultiIndex, I think you might need to operate on grouped data and concatenate afterwards.

Using your df1 and df2:

df1_grouped = df1.groupby(['month','group'])
results = []
for gid, grp in df1_grouped:
    # Filter df2 for the group of df1
    df2_grouped = df2[(df2['month']==gid[0]) & (df2['group']==gid[1])]
    
    # Create the group interval index for the lookup table
    int_idx = pd.IntervalIndex.from_arrays(df2_grouped['start'], df2_grouped['end'], 'left')
    
    if len(int_idx) > 0:
        # Use the interval index and slice the lookup table based on the group of df1
        results.append(grp.assign(period=df2_grouped.set_index(int_idx).loc[grp['time']]['period'].values))
    else:
        results.append(grp)
pd.concat(results).reset_index(drop=True)

Output:

    month   group   time    distance    period
0   1   A   10  100 1.0
1   1   A   20  120 2.0
2   1   A   25  110 3.0
3   1   B   5   90  1.0
4   1   B   7   99  1.0
5   1   B   17  60  2.0
6   1   C   30  200 NaN
7   1   C   35  170 NaN
8   2   A   5   40  1.0
9   2   A   10  27  2.0
10  2   B   17  33  1.0
11  2   B   19  65  1.0

Solution 3:[3]

I'm trying to do this just now. My approach is to create a concatenated index column using

df.ID.str + df.date.astype(str).replace('-','').

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 NickBraunagel
Solution 2 UnsolicitedCrap
Solution 3 Anshul Goyal