'How to delete multiple values inside given pandas dataframe when groupby and transform is used?

I was calculating mean,min,stddeviation of azimuth and distance.The sample data(df2) is below:
sim_Az varies from -55 to 55 with a stepsize = 5

Sim_Az	Sim_dist	RUT_Azi	RUT_Dist	Mean_dist	Min_dist	Mean_azi	Min_azi
-55	3.38	-55.85	4.38	4.34	4.33	-55.91	-55.99
-55	3.38	-55.86	4.37	4.34	4.33	-55.91	-55.99
-55	3.38	-55.84	4.37	4.34	4.33	-55.91	-55.99
-55	3.38	-55.84	4.37	4.34	4.33	-55.91	-55.99
-55	3.38	-55.95	4.35	4.34	4.33	-55.91	-55.99
-55	3.38	-55.86	4.38	4.34	4.33	-55.91	-55.99
-55	3.38	-55.98	4.35	4.34	4.33	-55.91	-55.99
-55	3.38	-55.84	4.37	4.34	4.33	-55.91	-55.99
-55	3.38	-55.96	4.35	4.34	4.33	-55.91	-55.99
-55	7.38	-55.86	7.45	7.43	7.41	-55.91	-55.99
-55	7.38	-55.87	7.46	7.43	7.41	-55.91	-55.99

The expected output is:

Sim_Az	Sim_dist	RUT_Azi	RUT_Dist	Mean_dist	Min_dist	Mean_azi	Min_azi	count	percent missed
-55	3.38	-55.85	4.38	4.34	4.33	-55.91	-55.99	9	91
-55	3.38	-55.86	4.37	7.43	7.41	-55.91	-55.99	2	98
-55	3.38	-55.84	4.37
-55	3.38	-55.84	4.37
-55	3.38	-55.95	4.35
-55	3.38	-55.86	4.38
-55	3.38	-55.98	4.35
-55	3.38	-55.84	4.37
-55	3.38	-55.96	4.35
-55	7.38	-55.86	7.45
-55	7.38	-55.87	7.46

The code i used to get the sample data:

df2['Mean_dist'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Dist'].transform('mean') 
df2['Min_dist'] = df2.groupby(['Sim_Az_','Sim_dist'])['RUT_Dist'].transform('min')
df2['Mean_Azi'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Azi'].transform('mean') 
df2['Min_Azi'] = df2.groupby(['Sim_Az_','Sim_dist'])['RUT_Azi'].transform('min')

Try1:When i tried to use aggregate i got an error(TypeError: incompatible index of inserted column with frame index)

df2['Mean_dist'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Dist'].agg('mean')

How to avoid the duplicate entries from Mean_dist to Min_azi and get an expected output as given ?
How can i add columns count,percent missed to sample data? The count is based on the number of multiple sim_dist and the percent is 100-count

Solution 1:^[1]

As you mentioned in the comment before, drop_duplicates() removed those rows, which was now what you wanted. Instead you can use mask() (or where()):

df = pd.read_csv("Data.csv")
df
#    Sim_Az  Sim_dist  RUT_Azi  RUT_Dist  Mean_dist  Min_dist  Mean_azi  Min_azi
#0      -55      3.38   -55.85      4.38       4.34      4.33    -55.91   -55.99
#1      -55      3.38   -55.86      4.37       4.34      4.33    -55.91   -55.99
#2      -55      3.38   -55.84      4.37       4.34      4.33    -55.91   -55.99
#3      -55      3.38   -55.84      4.37       4.34      4.33    -55.91   -55.99
#4      -55      3.38   -55.95      4.35       4.34      4.33    -55.91   -55.99
#5      -55      3.38   -55.86      4.38       4.34      4.33    -55.91   -55.99
#6      -55      3.38   -55.98      4.35       4.34      4.33    -55.91   -55.99
#7      -55      3.38   -55.84      4.37       4.34      4.33    -55.91   -55.99
#8      -55      3.38   -55.96      4.35       4.34      4.33    -55.91   -55.99
#9      -55      7.38   -55.86      7.45       7.43      7.41    -55.91   -55.99
#10     -55      7.38   -55.87      7.46       7.43      7.41    -55.91   -55.99

Calculations:

df["count"] = df.groupby("Sim_dist")["Sim_dist"].transform("count")
df["percent_missed"] = 100 - df["count"]
df[list(df.columns[-6:])] = df[list(df.columns[-6:])].mask(df[list(df.columns[-6:])].duplicated(),"")
df
#    Sim_Az  Sim_dist  RUT_Azi  RUT_Dist Mean_dist Min_dist Mean_azi Min_azi count percent_missed
#0      -55      3.38   -55.85      4.38      4.34     4.33   -55.91  -55.99     9             91
#1      -55      3.38   -55.86      4.37                                                         
#2      -55      3.38   -55.84      4.37                                                         
#3      -55      3.38   -55.84      4.37                                                         
#4      -55      3.38   -55.95      4.35                                                         
#5      -55      3.38   -55.86      4.38                                                         
#6      -55      3.38   -55.98      4.35                                                         
#7      -55      3.38   -55.84      4.37                                                         
#8      -55      3.38   -55.96      4.35                                                         
#9      -55      7.38   -55.86      7.45      7.43     7.41   -55.91  -55.99     2             98
#10     -55      7.38   -55.87      7.46

This can be broken down to make it clearer:

count and percent_missed columns created first (which must then also be duplicates, so will also be removed).
list(df.columns[-6:]) is the list of the columns that contain the duplicates (the last 6 columns, including count and percent_missed)
df[list(df.columns[-6:])] returns the dataframe with those columns only.
df[list(df.columns[-6:])].duplicated() returns a Boolean series of as to whether all columns are a duplicate of another row before it (so every single column must be the same).
df.mask(condition, other) replaces values with other anytime the condition is True.

I realise the output is slightly different, but this is so that the other columns also align. I'm not sure it makes sense to change Sim_dist from 7.38 to 3.38 for the second row of not-deleted duplicates.

If, however, this is how you want to do it, you can use drop_duplicates() in a copy of the dataframe, then merge the first columns (:-4) of the first dataframe with the last columns (-6:) of the second:

df2 = df.copy()
df2["count"] = df2.groupby("Sim_dist")["Sim_dist"].transform("count")
df2["percent_missed"] = 100 - df2["count"]
df2 = df2.iloc[:, -6:].drop_duplicates().reset_index()

df3 = df.iloc[:, :-4].merge(df2, left_index=True, right_index=True, how="outer").drop(columns=["index"]).fillna("")
df3
#    Sim_Az  Sim_dist  RUT_Azi  RUT_Dist Mean_dist Min_dist Mean_azi Min_azi count percent_missed
#0      -55      3.38   -55.85      4.38      4.34     4.33   -55.91  -55.99   9.0           91.0
#1      -55      3.38   -55.86      4.37      7.43     7.41   -55.91  -55.99   2.0           98.0
#2      -55      3.38   -55.84      4.37                                                         
#3      -55      3.38   -55.84      4.37                                                         
#4      -55      3.38   -55.95      4.35                                                         
#5      -55      3.38   -55.86      4.38                                                         
#6      -55      3.38   -55.98      4.35                                                         
#7      -55      3.38   -55.84      4.37                                                         
#8      -55      3.38   -55.96      4.35                                                         
#9      -55      7.38   -55.86      7.45                                                         
#10     -55      7.38   -55.87      7.46

This assumes you have not named your index at all (and it has been reset, if not, then use df.reset_index().iloc[:, :-4] in the last line).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Rawson

'How to delete multiple values inside given pandas dataframe when groupby and transform is used?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]