'Vectorize a function for a GroupBy Pandas Dataframe

I have a Pandas dataframe sorted by a datetime column. Several rows will have the same datetime, but the "report type" column value is different. I need to select just one of those rows based on a list of preferred report types. The list is in order of preference. So, if one of those rows has the first element in the list, then that is the row chosen to be appended to a new dataframe.

I've tried a GroupBy and the ever so slow Python for loops to process each group to find the preferred report type and append that row to a new dataframe. I thought about the numpy vectorize(), but I don't know how to incorporate the group by in it. I really don't know much about dataframes but am learning. Any ideas on how to make it faster? Can I incorporate the group by?

The example dataframe

OBSERVATIONTIME       REPTYPE   CIGFT
2000-01-01 00:00:00 AUTO    73300
2000-01-01 00:00:00 FM-15   25000
2000-01-01 00:00:00 FM-12   3000
2000-01-01 01:00:00 SAO 9000
2000-01-01 01:00:00 FM-16   600
2000-01-01 01:00:00 FM-15   5000
2000-01-01 01:00:00 AUTO    5000
2000-01-01 02:00:00 FM-12   12000
2000-01-01 02:00:00 FM-15   15000
2000-01-01 02:00:00 FM-16   8000
2000-01-01 03:00:00 SAO 700
2000-01-01 04:00:00 SAO 3000
2000-01-01 05:00:00 FM-16   5000
2000-01-01 06:00:00 AUTO    15000
2000-01-01 06:00:00 FM-12   12500
2000-01-01 06:00:00 FM-16   12000
2000-01-01 07:00:00 FM-15   20000

#################################################
# The function to loop through and find the row
################################################
    def select_the_one_ob(df):
    ''' select the preferred observation '''
    tophour_df = pd.DataFrame()
    preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 
'SY-MT', 'SY-SA']
    grouped = df.groupby("OBSERVATIONTIME", as_index=False)

    for name, group in grouped:
        a_group_df = pd.DataFrame(grouped.get_group(name))

        for reptype in preferred_order:
            preferred_found = False
            for i in a_group_df.index.values:
                if a_group_df.loc[i, 'REPTYPE'] == reptype:
                    tophour_df = 
tophour_df.append(a_group_df.loc[i].transpose())
                    preferred_found = True
                    break

            if preferred_found:
                break

        del a_group_df



    return tophour_df

################################################
### The function which calls the above function
################################################
def process_ceiling(plat, network):
    platformcig.data_pull(CONNECT_SRC, PULL_CEILING)
    data_df = platformcig.df

    data_df = select_the_one_ob(data_df)

With the complete dataset of 300,000 rows, the function takes over 4 hours. I need it to be much faster. Can I incorporate the group by in numpy vectorize()?



Solution 1:[1]

Found that creating a separate dataframe of the same shape populated with each hour of the observation time, I could use use pandas dataframe merge() and after the first pass use pandas dataframe combine_first(). This took only minutes instead of hours.

def select_the_one_ob(df):
    ''' select the preferred observation
    Parameters: 
        df (Pandas Object), a Pandas dataframe
    
    Returns Pandas Dataframe
    '''
    dshelldict = {'DateTime': pd.date_range(BEG_POR, END_POR, freq='H')}
    dshell = pd.DataFrame(data = dshelldict)

    dshell['YEAR'] = dshell['DateTime'].dt.year  
    dshell['MONTH'] = dshell['DateTime'].dt.month
    dshell['DAY'] = dshell['DateTime'].dt.day
    dshell['HOUR'] = dshell['DateTime'].dt.hour
    dshell = dshell.set_index(['YEAR','MONTH','DAY','HOUR'])
    df = df.set_index(['YEAR','MONTH','DAY','HOUR'])

    #tophour_df = pd.DataFrame()
    preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
    reptype_list = list(df.REPTYPE.unique())
    
    # remove the preferred report types from the unique ones
    for rep in preferred_order:
        if rep in reptype_list:
            reptype_list.remove(rep)
    
    # If there are any unique report types left, append them to the preferred list
    if len(reptype_list) > 0:
        preferred_order = preferred_order + reptype_list

    ## i is flag to make sure a report type is used to transfer columns to new DataFrame
    ## (Merge has to happen before combine first)
    first_pass = True
    for reptype in preferred_order:
        if first_pass:
            ## if there is data in dataframe
            if df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].shape[0]>0:
                first_pass = False
                # Merge shell with first df with data, the dataframe is sorted by original 
                # obstime and drop any dup's keeping first aka. first report chronologically
                tophour_df = dshell.merge( df[ (df['MINUTE']==00)&(df['REPTYPE']==reptype) ].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'),how ='left',left_index = True,right_index=True ).drop('DateTime',axis=1)
        else:
            # combine_first takes the original dataframe and fills any nan values with data 
            # of another identical shape dataframe
            # ex. if value df.loc[2,col1] is nan df2.loc[2,col1] would fill it if not nan
            tophour_df = tophour_df.combine_first(df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'))

    tophour_df = tophour_df.reset_index()

    return tophour_df

Solution 2:[2]

You can avoid to use groupby. One way could be to categorize your column 'REPTYPE' with pd.Categorical and then sort_values and drop_duplicates such as:

def select_the_one_ob(df):
    preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
    df.REPTYPE = pd.Categorical(df.REPTYPE, categories=preferred_order, ordered=True)
    return (df.sort_values(by=['OBSERVATIONTIME','REPTYPE'])
              .drop_duplicates(subset='OBSERVATIONTIME', keep='first'))

and you get with your example:

       OBSERVATIONTIME REPTYPE  CIGFT
1  2000-01-01 00:00:00   FM-15  25000
5  2000-01-01 01:00:00   FM-15   5000
8  2000-01-01 02:00:00   FM-15  15000
10 2000-01-01 03:00:00     SAO    700
11 2000-01-01 04:00:00     SAO   3000
12 2000-01-01 05:00:00   FM-16   5000
13 2000-01-01 06:00:00    AUTO  15000
16 2000-01-01 07:00:00   FM-15  20000

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 E. Zeytinci
Solution 2 Ben.T