'Dropping df 's rows inside an iterrows() function does'nt work

I have a pandas dataframe df1 and I want to create a new df2 with columns created from df1 's rows values :

    DAT_RUN DAT_FORECAST    LIB_SOURCE  TEMPERATURE_PREDITE NUM_DEVICE_ID_VISIOGREEN    DIFF_HOURS  TEMPERATURE_OBSERVEE    MonthNumber origine
0   2020-08-11  2020-08-11  arome_001   22.365373   20002748    0.0 22.51   8   Visiogreen
1   2020-08-11  2020-08-11  arpege_01   23.280028   20002748    0.0 22.51   8   Visiogreen

For example, for the first above 2 rows, I want only 1 row with news columns as:

enter image description here

Here is my code:

1) I iterate on each row of the df with iterrows() function:
2) For each row, I'm looking for identicals rows based on columns conditions
3) After, I drop rows:
    
print("START")

new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen = sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.copy()

final_df = pd.DataFrame()

for index1, row1 in new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.iterrows():
    
    print(f"""We manage : {row1['DAT_RUN']}, {row1['DAT_FORECAST']}, {row1['LIB_SOURCE']}, {row1['NUM_DEVICE_ID_VISIOGREEN']}, {row1['DIFF_HOURS']}, {row1['TEMPERATURE_OBSERVEE']}, {row1['TEMPERATURE_PREDITE']}, {row1['origine']}""")
    
    d = {'DAT_RUN'              : row1['DAT_RUN'],
         'DAT_FORECAST'         : row1['DAT_FORECAST'],
         'DIFF_HOURS'           : row1['DIFF_HOURS'],
         'TEMPERATURE_OBSERVEE' : row1['TEMPERATURE_OBSERVEE'],
         'MonthNumber'          : row1['MonthNumber']
        }
    
    df_temp = new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen[(new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['DAT_RUN'] == row1['DAT_RUN']) 
                                                                   & (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['DAT_FORECAST'] == row1['DAT_FORECAST'])
                                                                   & (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['NUM_DEVICE_ID_VISIOGREEN'] == row1['NUM_DEVICE_ID_VISIOGREEN'])
                                                                   & (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['DIFF_HOURS'] == row1['DIFF_HOURS'])
                                                                   & (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['TEMPERATURE_OBSERVEE'] == row1['TEMPERATURE_OBSERVEE'])
                                                                   & (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['origine'] == row1['origine'])]
    # Récupération des index
    index_names = df_temp.index
    
    #print(f"""Il y a {df_temp.shape[0]} enregistrements identiques""")
    
    for index2, row2 in df_temp.iterrows():
        print(f"""Which have identicals: {row2['DAT_RUN']}, {row2['DAT_FORECAST']}, {row2['DIFF_HOURS']}, {row2['LIB_SOURCE']}, {row2['NUM_DEVICE_ID_VISIOGREEN']},  {row2['TEMPERATURE_OBSERVEE']},  {row2['TEMPERATURE_PREDITE']}, {row2['origine']}""")
            
        if row2['LIB_SOURCE'] == 'arome_001':
            coldfname = 'T_AROME'
        else:
            if row2['LIB_SOURCE'] == 'arpege_01':
                coldfname = 'T_ARPEGE'
            else:
                if row2['LIB_SOURCE'] == 'gfs_025':
                    coldfname = 'T_GFS'
                else:
                    print("Attention: il n'y a pas dans ce bloc de prévisions arome, arpege ou GFS")
        d.update([(coldfname, row1['TEMPERATURE_PREDITE'])])
        
    final_df = final_df.append(d, ignore_index=True)
    
    # Suppression des lignes traitées
    print(f"""We drop {len(index_names)} enregistrements identiques""")
    
    #new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.drop(index1, inplace = True)
    new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.drop(index_names, inplace = True)
  
print("END")  

Output:

START
We manage : 2020-08-11 00:00:00, 2020-08-11 00:00:00, arome_001, 20002748, 0.0, 22.509999999999998, 22.3653728369278, Visiogreen
Which have identicals: 2020-08-11 00:00:00, 2020-08-11 00:00:00, 0.0, arome_001, 20002748,  22.509999999999998,  22.3653728369278, Visiogreen
Which have identicals: 2020-08-11 00:00:00, 2020-08-11 00:00:00, 0.0, arpege_01, 20002748,  22.509999999999998,  23.2800282137495, Visiogreen
We drop 2 enregistrements identiques
We manage : 2020-08-11 00:00:00, 2020-08-11 00:00:00, arpege_01, 20002748, 0.0, 22.509999999999998, 23.2800282137495, Visiogreen
We drop 0 enregistrements identiques
END

I have 2 problems in final_df:

 DAT_RUN    DAT_FORECAST    DIFF_HOURS  TEMPERATURE_OBSERVEE    MonthNumber T_AROME T_ARPEGE
0   2020-08-11  2020-08-11  0.0 22.51   8.0 22.365373   22.365373
1   2020-08-11  2020-08-11  0.0 22.51   8.0 NaN NaN
  1. a second line is created : I think that iterrows() does not well managed rows dropped inside loop !?

  2. and forecast value is the same in the 2 columns

Théo.



Solution 1:[1]

a second line is created : I think that iterrows() does not well managed rows dropped inside loop !?

Generally speaking, trying to modify an index (or any structure) in an iterative way is a risky exercise, as the structure will change from iteration to iteration, and references won't stay consistent. Now you have two nested loops in your code, and three DataFrames, which makes it all the more complicated.

I'd rather suggest to change your approach and that you take advantage of the grouping possibilities of DataFrames.

import numpy as np

def sum_and_keep_nans(s: pd.Series):
    """
    Aggregation function that returns NaN if ALL values in the Series are NaN, 
    or the sum of the Series values otherwise.
    """
    if s.isna().all():
        return np.nan
    return s.sum()

# From your code, I understand you want to keep the original df, so we make a copy
final_df = sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.copy()

# Now let's create the new columns. For each, we use "LIB_SOURCE" as a filter to 
# report 'TEMPERATURE_PREDITE', so values will be either that or NaN
final_df['T_AROME'] = final_df['TEMPERATURE_PREDITE'][final_df['LIB_SOURCE'] == 'arome_001']
final_df['T_ARPEGE'] = final_df['TEMPERATURE_PREDITE'][final_df['LIB_SOURCE'] == 'arpege_01']
final_df['T_GFS'] = final_df['TEMPERATURE_PREDITE'][final_df['LIB_SOURCE'] == 'gfs_025']

# Let's drop those, we don't need them anymore
final_df.drop(['TEMPERATURE_PREDITE', 'LIB_SOURCE'], axis=1, inplace=True)

# Now we have 3 columns, with either zeros or 'TEMPERATURE_PREDITE' as a value,
# we're going to sum them for each set of identical values in the other columns
columns_to_sum = ['T_AROME', 'T_ARPEGE', 'T_GFS']
# we define the grouping to apply the summation as all non-summed columns
columns_to_group = list(final_df.columns.drop(columns_to_sum))

# finally, we group by date, device..., use our custom aggregation function, and 
# we reset the index to keep the df columns
final_df = final_df.groupby(columns_to_group)[columns_to_sum].agg(sum_and_keep_nans).reset_index()

Ouput:

      DAT_RUN DAT_FORECAST  NUM_DEVICE_ID_VISIOGREEN  DIFF_HOURS  TEMPERATURE_OBSERVEE  MonthNumber     origine    T_AROME   T_ARPEGE  T_GFS
0  2020-08-11   2020-08-11                  20002748         0.0                 22.51            8  Visiogreen  22.365373  23.280028    NaN

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1