'Match IntervalIndex as part of a MultiIndex
I have a problem setup using a pandas IntervalIndex, similar to those posed here and here, but with differences that have thus far prevented me from successfully implementing their solutions.
I have two DataFrames, constructed as such:
df1
month group time distance
0 1 A 10 100
1 1 A 20 120
2 1 A 25 110
3 1 B 5 90
4 1 B 7 99
5 1 B 17 60
6 1 C 30 200
7 1 C 35 170
8 2 A 5 40
9 2 A 10 27
10 2 B 17 33
11 2 B 19 65
df1.set_index(['month', 'group', 'time'], inplace=True)
df2
month group start end period
0 1 A 5 15 1
1 1 A 15 21 2
2 1 A 21 30 3
3 1 B 2 10 1
4 1 B 10 20 2
5 2 A 3 8 1
6 2 A 8 15 2
7 2 B 10 20 1
df2.set_index(['month', 'group'], inplace=True)
Where start and end in df2 can be used to define an interval. My real data are much larger, with df1 on the order of 100,000 rows and df2 on the order of 10,000 rows.
What I would like to do is assign the period variable to df1, matching on month, group, and if df1['time'] is in the interval in df2.
There are two complications over the problems posed in the questions referenced above:
The
IntervalIndexneeds to be just one part of aMultiIndex, and the matching needs to happen on all three levels of the index.Sometimes a time exists in
df1that doesn't match an interval indf2. I don't have a strong preference for if those rows are populated withNaNor another missing value, or if they are just removed altogether, but what I can't have is aKeyError.
My first steps were:
df2Index = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='left')
df2.set_index(df2Index, append=True, inplace=True)
to apply the IntervalIndex to df2 while keeping month and group as higher level indices.
I tried a few approaches, such as
period = df2.loc[df2.index.get_indexer(df1.index), 'period']
period = df2.get_loc(df1.index), 'period']
But I have been unable to get the correct indexing behavior. For completeness, my desired result is:
month group time distance period
0 1 A 10 100 1
1 1 A 20 120 2
2 1 A 25 110 3
3 1 B 5 90 1
4 1 B 7 99 1
5 1 B 17 60 2
6 1 C 30 200 NaN
7 1 C 35 170 NaN
8 2 A 5 40 1
9 2 A 10 27 2
10 2 B 17 33 1
11 2 B 19 65 1
(or the same table but without the two rows where period is NaN).
My backup plan is to loop over the relevant combinations of month and group, subset the two tables appropriately, use the IntervalIndex on these subset tables, then recombine them. But that seems like the wrong solution.
Solution 1:[1]
I would merge the two frames together using a left join, then filter to show only those rows where time is within the desired start and end period:
import pandas as pd
# setup the frame
df1 = pd.DataFrame(
data={
'month': [1,1,1,1,1,1,1,1,2,2,2,2],
'group': ['A','A','A','B','B','B','C','C','A','A','B','B'],
'time': [10,20,25,5,7,17,30,35,5,10,17,19],
'distance': [100,120,110,90,99,60,200,170,40,27,33,65],
})
df2 = pd.DataFrame(
data={
'month': [1,1,1,1,1,2,2,2],
'group': ['A','A','A','B','B','A','A','B'],
'start': [5,15,21,2,10,3,8,10],
'end': [15,21,30,10,20,8,15,20],
'period': [1,2,3,1,2,1,2,1],
})
# merge, and filter
df = df1.merge(df2, how='left', on=['month','group'])
df = df[(df.time >= df.start) & (df.time <= df.end)][['month','group','time','distance','period']].reset_index(drop=True)
print df
month group time distance period
0 1 A 10 100 1.0
1 1 A 20 120 2.0
2 1 A 25 110 3.0
3 1 B 5 90 1.0
4 1 B 7 99 1.0
5 1 B 17 60 2.0
6 2 A 5 40 1.0
7 2 A 10 27 2.0
8 2 B 17 33 1.0
9 2 B 19 65 1.0
Note that the above frame excludes NaNs. If you want them, update the filter criteria: df[((df.time >= df.start) & (df.time <= df.end)) | (df.period.isnull())]
Solution 2:[2]
I came across this recently and since I don't haven't come across a solution using the MultiIndex, I think you might need to operate on grouped data and concatenate afterwards.
Using your df1 and df2:
df1_grouped = df1.groupby(['month','group'])
results = []
for gid, grp in df1_grouped:
# Filter df2 for the group of df1
df2_grouped = df2[(df2['month']==gid[0]) & (df2['group']==gid[1])]
# Create the group interval index for the lookup table
int_idx = pd.IntervalIndex.from_arrays(df2_grouped['start'], df2_grouped['end'], 'left')
if len(int_idx) > 0:
# Use the interval index and slice the lookup table based on the group of df1
results.append(grp.assign(period=df2_grouped.set_index(int_idx).loc[grp['time']]['period'].values))
else:
results.append(grp)
pd.concat(results).reset_index(drop=True)
Output:
month group time distance period
0 1 A 10 100 1.0
1 1 A 20 120 2.0
2 1 A 25 110 3.0
3 1 B 5 90 1.0
4 1 B 7 99 1.0
5 1 B 17 60 2.0
6 1 C 30 200 NaN
7 1 C 35 170 NaN
8 2 A 5 40 1.0
9 2 A 10 27 2.0
10 2 B 17 33 1.0
11 2 B 19 65 1.0
Solution 3:[3]
I'm trying to do this just now. My approach is to create a concatenated index column using
df.ID.str + df.date.astype(str).replace('-','').
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | NickBraunagel |
| Solution 2 | UnsolicitedCrap |
| Solution 3 | Anshul Goyal |
