'Dropping df 's rows inside an iterrows() function does'nt work
I have a pandas dataframe df1 and I want to create a new df2 with columns created from df1 's rows values :
DAT_RUN DAT_FORECAST LIB_SOURCE TEMPERATURE_PREDITE NUM_DEVICE_ID_VISIOGREEN DIFF_HOURS TEMPERATURE_OBSERVEE MonthNumber origine
0 2020-08-11 2020-08-11 arome_001 22.365373 20002748 0.0 22.51 8 Visiogreen
1 2020-08-11 2020-08-11 arpege_01 23.280028 20002748 0.0 22.51 8 Visiogreen
For example, for the first above 2 rows, I want only 1 row with news columns as:
Here is my code:
1) I iterate on each row of the df with iterrows() function:
2) For each row, I'm looking for identicals rows based on columns conditions
3) After, I drop rows:
print("START")
new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen = sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.copy()
final_df = pd.DataFrame()
for index1, row1 in new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.iterrows():
print(f"""We manage : {row1['DAT_RUN']}, {row1['DAT_FORECAST']}, {row1['LIB_SOURCE']}, {row1['NUM_DEVICE_ID_VISIOGREEN']}, {row1['DIFF_HOURS']}, {row1['TEMPERATURE_OBSERVEE']}, {row1['TEMPERATURE_PREDITE']}, {row1['origine']}""")
d = {'DAT_RUN' : row1['DAT_RUN'],
'DAT_FORECAST' : row1['DAT_FORECAST'],
'DIFF_HOURS' : row1['DIFF_HOURS'],
'TEMPERATURE_OBSERVEE' : row1['TEMPERATURE_OBSERVEE'],
'MonthNumber' : row1['MonthNumber']
}
df_temp = new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen[(new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['DAT_RUN'] == row1['DAT_RUN'])
& (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['DAT_FORECAST'] == row1['DAT_FORECAST'])
& (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['NUM_DEVICE_ID_VISIOGREEN'] == row1['NUM_DEVICE_ID_VISIOGREEN'])
& (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['DIFF_HOURS'] == row1['DIFF_HOURS'])
& (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['TEMPERATURE_OBSERVEE'] == row1['TEMPERATURE_OBSERVEE'])
& (new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen['origine'] == row1['origine'])]
# Récupération des index
index_names = df_temp.index
#print(f"""Il y a {df_temp.shape[0]} enregistrements identiques""")
for index2, row2 in df_temp.iterrows():
print(f"""Which have identicals: {row2['DAT_RUN']}, {row2['DAT_FORECAST']}, {row2['DIFF_HOURS']}, {row2['LIB_SOURCE']}, {row2['NUM_DEVICE_ID_VISIOGREEN']}, {row2['TEMPERATURE_OBSERVEE']}, {row2['TEMPERATURE_PREDITE']}, {row2['origine']}""")
if row2['LIB_SOURCE'] == 'arome_001':
coldfname = 'T_AROME'
else:
if row2['LIB_SOURCE'] == 'arpege_01':
coldfname = 'T_ARPEGE'
else:
if row2['LIB_SOURCE'] == 'gfs_025':
coldfname = 'T_GFS'
else:
print("Attention: il n'y a pas dans ce bloc de prévisions arome, arpege ou GFS")
d.update([(coldfname, row1['TEMPERATURE_PREDITE'])])
final_df = final_df.append(d, ignore_index=True)
# Suppression des lignes traitées
print(f"""We drop {len(index_names)} enregistrements identiques""")
#new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.drop(index1, inplace = True)
new_sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.drop(index_names, inplace = True)
print("END")
Output:
START
We manage : 2020-08-11 00:00:00, 2020-08-11 00:00:00, arome_001, 20002748, 0.0, 22.509999999999998, 22.3653728369278, Visiogreen
Which have identicals: 2020-08-11 00:00:00, 2020-08-11 00:00:00, 0.0, arome_001, 20002748, 22.509999999999998, 22.3653728369278, Visiogreen
Which have identicals: 2020-08-11 00:00:00, 2020-08-11 00:00:00, 0.0, arpege_01, 20002748, 22.509999999999998, 23.2800282137495, Visiogreen
We drop 2 enregistrements identiques
We manage : 2020-08-11 00:00:00, 2020-08-11 00:00:00, arpege_01, 20002748, 0.0, 22.509999999999998, 23.2800282137495, Visiogreen
We drop 0 enregistrements identiques
END
I have 2 problems in final_df:
DAT_RUN DAT_FORECAST DIFF_HOURS TEMPERATURE_OBSERVEE MonthNumber T_AROME T_ARPEGE
0 2020-08-11 2020-08-11 0.0 22.51 8.0 22.365373 22.365373
1 2020-08-11 2020-08-11 0.0 22.51 8.0 NaN NaN
a second line is created : I think that iterrows() does not well managed rows dropped inside loop !?
and forecast value is the same in the 2 columns
Théo.
Solution 1:[1]
a second line is created : I think that iterrows() does not well managed rows dropped inside loop !?
Generally speaking, trying to modify an index (or any structure) in an iterative way is a risky exercise, as the structure will change from iteration to iteration, and references won't stay consistent. Now you have two nested loops in your code, and three DataFrames, which makes it all the more complicated.
I'd rather suggest to change your approach and that you take advantage of the grouping possibilities of DataFrames.
import numpy as np
def sum_and_keep_nans(s: pd.Series):
"""
Aggregation function that returns NaN if ALL values in the Series are NaN,
or the sum of the Series values otherwise.
"""
if s.isna().all():
return np.nan
return s.sum()
# From your code, I understand you want to keep the original df, so we make a copy
final_df = sorted_fpm_dsczd_sencrop_fpm_dsczd_fact_visiogreen.copy()
# Now let's create the new columns. For each, we use "LIB_SOURCE" as a filter to
# report 'TEMPERATURE_PREDITE', so values will be either that or NaN
final_df['T_AROME'] = final_df['TEMPERATURE_PREDITE'][final_df['LIB_SOURCE'] == 'arome_001']
final_df['T_ARPEGE'] = final_df['TEMPERATURE_PREDITE'][final_df['LIB_SOURCE'] == 'arpege_01']
final_df['T_GFS'] = final_df['TEMPERATURE_PREDITE'][final_df['LIB_SOURCE'] == 'gfs_025']
# Let's drop those, we don't need them anymore
final_df.drop(['TEMPERATURE_PREDITE', 'LIB_SOURCE'], axis=1, inplace=True)
# Now we have 3 columns, with either zeros or 'TEMPERATURE_PREDITE' as a value,
# we're going to sum them for each set of identical values in the other columns
columns_to_sum = ['T_AROME', 'T_ARPEGE', 'T_GFS']
# we define the grouping to apply the summation as all non-summed columns
columns_to_group = list(final_df.columns.drop(columns_to_sum))
# finally, we group by date, device..., use our custom aggregation function, and
# we reset the index to keep the df columns
final_df = final_df.groupby(columns_to_group)[columns_to_sum].agg(sum_and_keep_nans).reset_index()
Ouput:
DAT_RUN DAT_FORECAST NUM_DEVICE_ID_VISIOGREEN DIFF_HOURS TEMPERATURE_OBSERVEE MonthNumber origine T_AROME T_ARPEGE T_GFS
0 2020-08-11 2020-08-11 20002748 0.0 22.51 8 Visiogreen 22.365373 23.280028 NaN
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |