'How to get distinct rows from pandas dataframe?

I am having trouble with getting distinct values from my dataframe.. Below is the code i currently use, in line 25(3rd of vier()) is the issue: I would like to show the top 10 fastest drivers based on their average heat(go-kart heat) time.

Input:

HeatNumber,NumberOfKarts,KartNumber,DriverName,Laptime
334,11,5,Monique,00:53.862
334,11,5,Monique,00:59.070
334,11,5,Monique,00:47.832
334,11,5,Monique,00:47.213
334,11,5,Monique,00:51.975
334,11,5,Monique,00:46.423
334,11,5,Monique,00:49.539
334,11,5,Monique,00:49.935
334,11,5,Monique,00:45.267
334,11,12,Robert-Jan,00:55.606
334,11,12,Robert-Jan,00:52.249
334,11,12,Robert-Jan,00:50.965
334,11,12,Robert-Jan,00:53.878
334,11,12,Robert-Jan,00:48.802
334,11,12,Robert-Jan,00:48.766
334,11,12,Robert-Jan,00:46.003
334,11,12,Robert-Jan,00:46.257
334,11,12,Robert-Jan,00:47.334
334,11,20,Katja,00:56.222
334,11,20,Katja,01:01.005
334,11,20,Katja,00:50.296
334,11,20,Katja,00:48.004
334,11,20,Katja,00:51.203
334,11,20,Katja,00:47.672
334,11,20,Katja,00:50.243
334,11,20,Katja,00:50.453
334,11,20,Katja,01:06.192
334,11,13,Bensu,00:56.332
334,11,13,Bensu,00:54.550
334,11,13,Bensu,00:52.023
334,11,13,Bensu,00:52.518
334,11,13,Bensu,00:50.738
334,11,13,Bensu,00:50.359
334,11,13,Bensu,00:49.307
334,11,13,Bensu,00:49.595
334,11,13,Bensu,00:50.504
334,11,17,Marit,00:56.740
334,11,17,Marit,00:52.534
334,11,17,Marit,00:48.331
334,11,17,Marit,00:56.204
334,11,17,Marit,00:49.066
334,11,17,Marit,00:49.210
334,11,17,Marit,00:45.655
334,11,17,Marit,00:46.261
334,11,17,Marit,00:46.837
334,11,11,Niels,00:58.518
334,11,11,Niels,01:01.562
334,11,11,Niels,00:51.238
334,11,11,Niels,00:48.808

Code:

import pandas as pd
import matplotlib.pyplot as plt

#Data
df = pd.read_csv('dataset_kartanalyser.csv')
df = df.dropna(axis=0, how='any')
df = df.join(df['Laptime'].str.split(':', 1, expand=True).rename(columns={0:'M', 1:'S'}))
df['M'] = df['M'].astype(int)
df['S'] = df['S'].astype(float)
df['Laptime'] = (df['M'] * 60) + df['S']
df.drop(['M', 'S'], axis=1, inplace=True)

#Funties
def twee():
    print("Het totaal aantal karts = " + str(df['KartNumber'].nunique())) 
    print("Het aantal unique drivers = " + str(df['DriverName'].nunique()))
    print("Het totaal aantal heats = " + str(df['HeatNumber'].nunique())) 

def drie():
    print("De 10 snelste Drivers obv individuele tijd zijn: ")
    print((df.groupby('DriverName')['Laptime'].nsmallest(1)).nsmallest(10))

def vier():
    print('De 10 snelste Drivers obv snelste heat gemiddelde:')
    print((df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)).nsmallest(10))

print(df)
     HeatNumber  NumberOfKarts KartNumber DriverName  Laptime
0           334             11          5    Monique   53.862
1           334             11          5    Monique   59.070
2           334             11          5    Monique   47.832
3           334             11          5    Monique   47.213
4           334             11          5    Monique   51.975
...         ...            ...        ...        ...      ...
4053        437              2         20       luuk   39.678
4054        437              2         20       luuk   39.872
4055        437              2         20       luuk   39.454
4056        437              2         20       luuk   39.575
4057        437              2         20       luuk   39.648

Output:

DriverName   HeatNumber
giovanni     411           26.233
ryan         411           27.747
giovanni     408           27.938
papa         394           28.075
guus         406           28.998
Rob          427           29.371
Suus         427           29.416
Jan-jullius  394           29.428
Joep         427           29.934
Indy         423           29.991

The output i get i almost correct, expect that the driver "giovanni" occurs twice. I would like to only show the fastest avg heat time for each driver. Anyone who know how to do this?



Solution 1:[1]

ok so add drop_duplication on a column like this just need to add sort as well df.sort_values('B', ascending=True)
.drop_duplicates('A', keep='first')

(df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)sort_values('Laptime', ascending=True).drop_duplicates('DriverName', keep='first')).nsmallest(10))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1