'How to delete multiple values inside given pandas dataframe when groupby and transform is used?

I was calculating mean,min,stddeviation of azimuth and distance.The sample data(df2) is below:
sim_Az varies from -55 to 55 with a stepsize = 5

Sim_Az Sim_dist RUT_Azi RUT_Dist Mean_dist Min_dist Mean_azi Min_azi
-55 3.38 -55.85 4.38 4.34 4.33 -55.91 -55.99
-55 3.38 -55.86 4.37 4.34 4.33 -55.91 -55.99
-55 3.38 -55.84 4.37 4.34 4.33 -55.91 -55.99
-55 3.38 -55.84 4.37 4.34 4.33 -55.91 -55.99
-55 3.38 -55.95 4.35 4.34 4.33 -55.91 -55.99
-55 3.38 -55.86 4.38 4.34 4.33 -55.91 -55.99
-55 3.38 -55.98 4.35 4.34 4.33 -55.91 -55.99
-55 3.38 -55.84 4.37 4.34 4.33 -55.91 -55.99
-55 3.38 -55.96 4.35 4.34 4.33 -55.91 -55.99
-55 7.38 -55.86 7.45 7.43 7.41 -55.91 -55.99
-55 7.38 -55.87 7.46 7.43 7.41 -55.91 -55.99

The expected output is:

Sim_Az Sim_dist RUT_Azi RUT_Dist Mean_dist Min_dist Mean_azi Min_azi count percent missed
-55 3.38 -55.85 4.38 4.34 4.33 -55.91 -55.99 9 91
-55 3.38 -55.86 4.37 7.43 7.41 -55.91 -55.99 2 98
-55 3.38 -55.84 4.37
-55 3.38 -55.84 4.37
-55 3.38 -55.95 4.35
-55 3.38 -55.86 4.38
-55 3.38 -55.98 4.35
-55 3.38 -55.84 4.37
-55 3.38 -55.96 4.35
-55 7.38 -55.86 7.45
-55 7.38 -55.87 7.46

The code i used to get the sample data:

df2['Mean_dist'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Dist'].transform('mean') 
df2['Min_dist'] = df2.groupby(['Sim_Az_','Sim_dist'])['RUT_Dist'].transform('min')
df2['Mean_Azi'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Azi'].transform('mean') 
df2['Min_Azi'] = df2.groupby(['Sim_Az_','Sim_dist'])['RUT_Azi'].transform('min')

Try1:When i tried to use aggregate i got an error(TypeError: incompatible index of inserted column with frame index)

df2['Mean_dist'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Dist'].agg('mean')

How to avoid the duplicate entries from Mean_dist to Min_azi and get an expected output as given ?
How can i add columns count,percent missed to sample data? The count is based on the number of multiple sim_dist and the percent is 100-count



Solution 1:[1]

As you mentioned in the comment before, drop_duplicates() removed those rows, which was now what you wanted. Instead you can use mask() (or where()):

df = pd.read_csv("Data.csv")
df
#    Sim_Az  Sim_dist  RUT_Azi  RUT_Dist  Mean_dist  Min_dist  Mean_azi  Min_azi
#0      -55      3.38   -55.85      4.38       4.34      4.33    -55.91   -55.99
#1      -55      3.38   -55.86      4.37       4.34      4.33    -55.91   -55.99
#2      -55      3.38   -55.84      4.37       4.34      4.33    -55.91   -55.99
#3      -55      3.38   -55.84      4.37       4.34      4.33    -55.91   -55.99
#4      -55      3.38   -55.95      4.35       4.34      4.33    -55.91   -55.99
#5      -55      3.38   -55.86      4.38       4.34      4.33    -55.91   -55.99
#6      -55      3.38   -55.98      4.35       4.34      4.33    -55.91   -55.99
#7      -55      3.38   -55.84      4.37       4.34      4.33    -55.91   -55.99
#8      -55      3.38   -55.96      4.35       4.34      4.33    -55.91   -55.99
#9      -55      7.38   -55.86      7.45       7.43      7.41    -55.91   -55.99
#10     -55      7.38   -55.87      7.46       7.43      7.41    -55.91   -55.99

Calculations:

df["count"] = df.groupby("Sim_dist")["Sim_dist"].transform("count")
df["percent_missed"] = 100 - df["count"]
df[list(df.columns[-6:])] = df[list(df.columns[-6:])].mask(df[list(df.columns[-6:])].duplicated(),"")
df
#    Sim_Az  Sim_dist  RUT_Azi  RUT_Dist Mean_dist Min_dist Mean_azi Min_azi count percent_missed
#0      -55      3.38   -55.85      4.38      4.34     4.33   -55.91  -55.99     9             91
#1      -55      3.38   -55.86      4.37                                                         
#2      -55      3.38   -55.84      4.37                                                         
#3      -55      3.38   -55.84      4.37                                                         
#4      -55      3.38   -55.95      4.35                                                         
#5      -55      3.38   -55.86      4.38                                                         
#6      -55      3.38   -55.98      4.35                                                         
#7      -55      3.38   -55.84      4.37                                                         
#8      -55      3.38   -55.96      4.35                                                         
#9      -55      7.38   -55.86      7.45      7.43     7.41   -55.91  -55.99     2             98
#10     -55      7.38   -55.87      7.46                                                         

This can be broken down to make it clearer:

  1. count and percent_missed columns created first (which must then also be duplicates, so will also be removed).
  2. list(df.columns[-6:]) is the list of the columns that contain the duplicates (the last 6 columns, including count and percent_missed)
  3. df[list(df.columns[-6:])] returns the dataframe with those columns only.
  4. df[list(df.columns[-6:])].duplicated() returns a Boolean series of as to whether all columns are a duplicate of another row before it (so every single column must be the same).
  5. df.mask(condition, other) replaces values with other anytime the condition is True.

I realise the output is slightly different, but this is so that the other columns also align. I'm not sure it makes sense to change Sim_dist from 7.38 to 3.38 for the second row of not-deleted duplicates.

If, however, this is how you want to do it, you can use drop_duplicates() in a copy of the dataframe, then merge the first columns (:-4) of the first dataframe with the last columns (-6:) of the second:

df2 = df.copy()
df2["count"] = df2.groupby("Sim_dist")["Sim_dist"].transform("count")
df2["percent_missed"] = 100 - df2["count"]
df2 = df2.iloc[:, -6:].drop_duplicates().reset_index()

df3 = df.iloc[:, :-4].merge(df2, left_index=True, right_index=True, how="outer").drop(columns=["index"]).fillna("")
df3
#    Sim_Az  Sim_dist  RUT_Azi  RUT_Dist Mean_dist Min_dist Mean_azi Min_azi count percent_missed
#0      -55      3.38   -55.85      4.38      4.34     4.33   -55.91  -55.99   9.0           91.0
#1      -55      3.38   -55.86      4.37      7.43     7.41   -55.91  -55.99   2.0           98.0
#2      -55      3.38   -55.84      4.37                                                         
#3      -55      3.38   -55.84      4.37                                                         
#4      -55      3.38   -55.95      4.35                                                         
#5      -55      3.38   -55.86      4.38                                                         
#6      -55      3.38   -55.98      4.35                                                         
#7      -55      3.38   -55.84      4.37                                                         
#8      -55      3.38   -55.96      4.35                                                         
#9      -55      7.38   -55.86      7.45                                                         
#10     -55      7.38   -55.87      7.46                                                         
  • This assumes you have not named your index at all (and it has been reset, if not, then use df.reset_index().iloc[:, :-4] in the last line).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rawson