'How do I implement rank function for nearest values for a column in dataframe?
df.head():
run_time match_datetime country league home_team away_team
0 2021-08-07 00:04:36.326391 2021-08-06 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka
1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19
2 2021-08-07 00:04:36.326391 2021-08-08 World Club Friendly Alaves Al Nasr
3 2021-08-07 00:04:36.326391 2021-08-09 China Jia League Chengdu Rongcheng Shenyang Urban FC
4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger
5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
6 2021-08-06 00:04:36.326391 2021-08-08 Russia Youth League Konoplev Academy U19 Rubin Kazan U19
7 2021-08-06 00:04:36.326391 2021-08-09 World Club Friendly Real Sociedad Eibar
desired df
run_time match_datetime country league home_team away_team
0 2021-08-07 00:04:36.326391 2021-08-06 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka
1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19
4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger
5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
How do i use rank function to filter only the 2 nearest match_datetime dates for every run_time value.
i.e. desired dataframe will be a filtered dataframe that will have all the nearest 2 match_datetime values for every run_time
Solution 1:[1]
Update
Using rank instead of head:
diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs()
out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]
Output:
>>> out
run_time match_datetime country league home_team away_team
1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19
2 2021-08-07 00:04:36.326391 2021-08-08 World Club Friendly Alaves Al Nasr
4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger
5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
Alternative
You can use:
diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \
.abs().sort_values()
out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()
Solution 2:[2]
I am somehow afraid that the pandas.DataFrame.rank method can't do this. But pandas.DataFrame.groupby can do this, if you use pandas.DataFrame.head with it.
Assuming you have the following pandas.DataFrame:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])
And that you want to keep max_num_per_example = 2 representatives of each unique values in the column df['a']:
max_num_per_example = 2
df.groupby(['a']).head(max_num_per_example)
yields
| a | b | |
|---|---|---|
| 0 | 2.0 | 0.058084 |
| 1 | 0.0 | 0.866176 |
| 2 | 2.0 | 0.601115 |
| 4 | 0.0 | 0.020584 |
| 7 | 1.0 | 0.212339 |
This is the same as you would get if you to the naive approach:
max_idx_per_example = 2
idx_to_keep = []
for el_uq in df['a'].unique():
lg = el_uq == df['a']
for i, idx in enumerate(lg[lg].index):
if i < max_idx_per_example:
idx_to_keep.append(idx)
else:
break
df_new = df.iloc[idx_to_keep]
Which underlines the power of pandas =)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Corralien |
| Solution 2 | max |
