'pandas get a subset of rows within a size range
I have a pandas dataframe:
df = pd.DataFrame({'start': [50, 100, 50000, 50030, 100000],
'end': [51, 101, 50001, 50031, 100001],
'value': [1, 2, 3, 4, 5]},
index=['id1', 'id2', 'id3', 'id4', 'id5'])
>>> df
start end value
id1 50 51 1
id2 100 101 2
id3 50000 50001 3
id4 50030 50031 4
id5 100000 100001 5
Now I would like to extract groups of all rows within the size range of 150 in column "start". The output should look like:
group group_start group_end min_val max_value id_count
1 50 101 1 2 2
2 50000 50031 3 4 2
3 100000 100001 5 5 1
How to extract those groups?
Solution 1:[1]
Use:
start = df['start'].iloc[0]
g = 0
gs = []
for val in df['start']:
if val-start<150:
gs.append(g)
else:
g+=1
start = val
gs.append(g)
df['g'] = gs
df.groupby('g').agg(group_start = ('start', 'first'), group_end = ('end', 'last'), min_val = ('value', 'min'), max_value = ('value', 'max'), id_count = ('value', 'count'))
Output:
Based on the comment:
df.groupby('g').agg(group_start = ('start', 'first'), group_end = ('end', 'last'), min_val = ('value', 'min'), max_value = ('value', 'max'), id_count = ('value', 'idxmax'))
Solution 2:[2]
For range 150 is possible use integer division by 150 with factorize for groups starting by 1 and then aggregate columns by named aggregations, last add column group:
a = pd.factorize(df['start'] // 150)[0] + 1
df = (df.groupby(a).agg(group_start = ('start','first'),
group_end = ('end','last'),
min_val = ('value','min'),
max_val = ('value','max'),
id_count=('value','size'))
.rename_axis('group')
.reset_index())
print (df)
group group_start group_end min_val max_val id_count
0 1 50 101 1 2 2
1 2 50000 50031 3 4 2
2 3 100000 100001 5 5 1
If 150 is count from values by subtracted first value of start use:
a = pd.factorize(df['start'].sub(df['start'].iat[0]) // 150)[0] + 1
df = (df.groupby(a).agg(group_start = ('start','first'),
group_end = ('end','last'),
min_val = ('value','min'),
max_val = ('value','max'),
id_count=('value','size'))
.rename_axis('group')
.reset_index())
EDIT: For indices by maximal values by value column use DataFrameGroupBy.idxmax:
a = pd.factorize(df['start'] // 150)[0] + 1
df = (df.groupby(a).agg(group_start = ('start','first'),
group_end = ('end','last'),
min_val = ('value','min'),
max_val = ('value','max'),
id_count=('value','size'),
indicec_by_max_val=('value','idxmax'))
.rename_axis('group')
.reset_index())
print (df)
group group_start group_end min_val max_val id_count \
0 1 50 101 1 2 2
1 2 50000 50031 3 4 2
2 3 100000 100001 5 5 1
indicec_by_max_val
0 id2
1 id4
2 id5
Solution 3:[3]
This will use your actual values in df["start"] for binning - borders and not use a fixed binning into 0...149,150..299, like jezrael's answer does.
It follows a quite manual (hence probably suboptimal) approach ... not sure how it works for bigger datasets - but here it goes:
import pandas as pd
df = pd.DataFrame({'start': [50, 100, 50000, 50030, 100000],
'end': [51, 101, 50001, 50031, 100001],
'value': [1, 2, 3, 4, 5]},
index=['id1', 'id2', 'id3', 'id4', 'id5'])
gcol = "group,group_start,group_end,min_val,max_value,id_count".split(",")
grpd = pd.DataFrame(columns=gcol)
last_end = None # to skip already included values
groupnr = 1 # simple numbering
# df should be sorted for "start" like your demo data is
for value in df.start:
# skip row if already included in last range
if value < (last_end or value):
print("Skipped", value)
continue
# get all values in range
m = df[ df["start"].isin(range(-150+value,value+151))]
# get stuff for grouped df
mi = m.start.min()
ma = m.end.max()
last_end = ma
mi_i = m[m.start == mi].value[0]
ma_i = m[m.end == ma].value[0]
l = len(m)
# add to grp'ed df
grpd = grpd.append(dict(zip(gcol, [groupnr, mi,ma,mi_i,ma_i,l])),
ignore_index=True)
# and one more
groupnr += 1
# output shenanigans
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 10000)
print(grpd)
Output:
Skipped 100 # was already used up in 50-200
Skipped 50030 # was already used up in 50000-50150
group group_start group_end min_val max_value id_count
0 1 50 101 1 2 2
1 2 50000 50031 3 4 2
2 3 100000 100001 5 5 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 |

