'pandas get a subset of rows within a size range

I have a pandas dataframe:

df = pd.DataFrame({'start': [50, 100, 50000, 50030, 100000],
                'end': [51, 101, 50001, 50031, 100001],
                'value': [1, 2, 3, 4, 5]},
               index=['id1', 'id2', 'id3', 'id4', 'id5'])

>>> df
            start      end      value
 id1           50      51         1
 id2           100     101        2
 id3           50000   50001      3
 id4           50030   50031      4
 id5           100000  100001     5

Now I would like to extract groups of all rows within the size range of 150 in column "start". The output should look like:

group    group_start   group_end       min_val      max_value   id_count
  1         50           101             1              2         2
  2         50000        50031           3              4         2
  3         100000       100001          5              5         1

How to extract those groups?



Solution 1:[1]

Use:

start = df['start'].iloc[0]
g = 0
gs = []
for val in df['start']:
    if val-start<150:
        gs.append(g)
    else:
        g+=1
        start = val
        gs.append(g)
        
df['g'] = gs
df.groupby('g').agg(group_start = ('start', 'first'), group_end = ('end', 'last'), min_val = ('value', 'min'), max_value = ('value', 'max'), id_count = ('value', 'count'))

Output:

enter image description here

Based on the comment:

df.groupby('g').agg(group_start = ('start', 'first'), group_end = ('end', 'last'), min_val = ('value', 'min'), max_value = ('value', 'max'), id_count = ('value', 'idxmax'))

Solution 2:[2]

For range 150 is possible use integer division by 150 with factorize for groups starting by 1 and then aggregate columns by named aggregations, last add column group:

a = pd.factorize(df['start'] // 150)[0] + 1
df = (df.groupby(a).agg(group_start = ('start','first'),
                                        group_end = ('end','last'),
                                        min_val = ('value','min'),
                                        max_val = ('value','max'),
                                        id_count=('value','size'))
                                    .rename_axis('group')
                                    .reset_index())

print (df)
   group  group_start  group_end  min_val  max_val  id_count
0      1           50        101        1        2         2
1      2        50000      50031        3        4         2
2      3       100000     100001        5        5         1

If 150 is count from values by subtracted first value of start use:

a = pd.factorize(df['start'].sub(df['start'].iat[0]) // 150)[0] + 1
df = (df.groupby(a).agg(group_start = ('start','first'),
                                        group_end = ('end','last'),
                                        min_val = ('value','min'),
                                        max_val = ('value','max'),
                                        id_count=('value','size'))
                                    .rename_axis('group')
                                    .reset_index())

EDIT: For indices by maximal values by value column use DataFrameGroupBy.idxmax:

a = pd.factorize(df['start'] // 150)[0] + 1
df = (df.groupby(a).agg(group_start = ('start','first'),
                                        group_end = ('end','last'),
                                        min_val = ('value','min'),
                                        max_val = ('value','max'),
                                        id_count=('value','size'),
                                        indicec_by_max_val=('value','idxmax'))
                                    .rename_axis('group')
                                    .reset_index())

print (df)
   group  group_start  group_end  min_val  max_val  id_count  \
0      1           50        101        1        2         2   
1      2        50000      50031        3        4         2   
2      3       100000     100001        5        5         1   

  indicec_by_max_val  
0                id2  
1                id4  
2                id5  

Solution 3:[3]

This will use your actual values in df["start"] for binning - borders and not use a fixed binning into 0...149,150..299, like jezrael's answer does.

It follows a quite manual (hence probably suboptimal) approach ... not sure how it works for bigger datasets - but here it goes:

import pandas as pd 

df = pd.DataFrame({'start': [50, 100, 50000, 50030, 100000],
                'end': [51, 101, 50001, 50031, 100001],
                'value': [1, 2, 3, 4, 5]},
            index=['id1', 'id2', 'id3', 'id4', 'id5'])

gcol = "group,group_start,group_end,min_val,max_value,id_count".split(",")
grpd = pd.DataFrame(columns=gcol)

last_end = None   # to skip already included values
groupnr = 1       # simple numbering

# df should be sorted for "start" like your demo data is
for value in df.start:
    # skip row if already included in last range
    if value < (last_end or value): 
        print("Skipped", value)
        continue

    # get all values in range 
    m = df[ df["start"].isin(range(-150+value,value+151))]

    # get stuff for grouped df
    mi = m.start.min()
    ma = m.end.max()
    last_end = ma
    mi_i = m[m.start == mi].value[0]
    ma_i = m[m.end == ma].value[0]
    l = len(m)

    # add to grp'ed df
    grpd = grpd.append(dict(zip(gcol, [groupnr, mi,ma,mi_i,ma_i,l])), 
                       ignore_index=True)
    # and one more
    groupnr += 1

# output shenanigans
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 10000)
print(grpd)

Output:

Skipped 100      # was already used up in 50-200
Skipped 50030    # was already used up in 50000-50150

      group group_start group_end min_val max_value id_count
0     1     50          101       1       2         2
1     2     50000       50031     3       4         2
2     3     100000      100001    5       5         1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3