'How to delete multiple values inside given pandas dataframe when groupby and transform is used?
I was calculating mean,min,stddeviation of azimuth and distance.The sample data(df2) is below:
sim_Az varies from -55 to 55 with a stepsize = 5
| Sim_Az | Sim_dist | RUT_Azi | RUT_Dist | Mean_dist | Min_dist | Mean_azi | Min_azi |
|---|---|---|---|---|---|---|---|
| -55 | 3.38 | -55.85 | 4.38 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 3.38 | -55.86 | 4.37 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 3.38 | -55.84 | 4.37 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 3.38 | -55.84 | 4.37 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 3.38 | -55.95 | 4.35 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 3.38 | -55.86 | 4.38 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 3.38 | -55.98 | 4.35 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 3.38 | -55.84 | 4.37 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 3.38 | -55.96 | 4.35 | 4.34 | 4.33 | -55.91 | -55.99 |
| -55 | 7.38 | -55.86 | 7.45 | 7.43 | 7.41 | -55.91 | -55.99 |
| -55 | 7.38 | -55.87 | 7.46 | 7.43 | 7.41 | -55.91 | -55.99 |
The expected output is:
| Sim_Az | Sim_dist | RUT_Azi | RUT_Dist | Mean_dist | Min_dist | Mean_azi | Min_azi | count | percent missed |
|---|---|---|---|---|---|---|---|---|---|
| -55 | 3.38 | -55.85 | 4.38 | 4.34 | 4.33 | -55.91 | -55.99 | 9 | 91 |
| -55 | 3.38 | -55.86 | 4.37 | 7.43 | 7.41 | -55.91 | -55.99 | 2 | 98 |
| -55 | 3.38 | -55.84 | 4.37 | ||||||
| -55 | 3.38 | -55.84 | 4.37 | ||||||
| -55 | 3.38 | -55.95 | 4.35 | ||||||
| -55 | 3.38 | -55.86 | 4.38 | ||||||
| -55 | 3.38 | -55.98 | 4.35 | ||||||
| -55 | 3.38 | -55.84 | 4.37 | ||||||
| -55 | 3.38 | -55.96 | 4.35 | ||||||
| -55 | 7.38 | -55.86 | 7.45 | ||||||
| -55 | 7.38 | -55.87 | 7.46 |
The code i used to get the sample data:
df2['Mean_dist'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Dist'].transform('mean')
df2['Min_dist'] = df2.groupby(['Sim_Az_','Sim_dist'])['RUT_Dist'].transform('min')
df2['Mean_Azi'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Azi'].transform('mean')
df2['Min_Azi'] = df2.groupby(['Sim_Az_','Sim_dist'])['RUT_Azi'].transform('min')
Try1:When i tried to use aggregate i got an error(TypeError: incompatible index of inserted column with frame index)
df2['Mean_dist'] = df2.groupby(['Sim_Az','Sim_dist'])['RUT_Dist'].agg('mean')
How to avoid the duplicate entries from Mean_dist to Min_azi and get an expected output as given ?
How can i add columns count,percent missed to sample data?
The count is based on the number of multiple sim_dist and the percent is 100-count
Solution 1:[1]
As you mentioned in the comment before, drop_duplicates() removed those rows, which was now what you wanted. Instead you can use mask() (or where()):
df = pd.read_csv("Data.csv")
df
# Sim_Az Sim_dist RUT_Azi RUT_Dist Mean_dist Min_dist Mean_azi Min_azi
#0 -55 3.38 -55.85 4.38 4.34 4.33 -55.91 -55.99
#1 -55 3.38 -55.86 4.37 4.34 4.33 -55.91 -55.99
#2 -55 3.38 -55.84 4.37 4.34 4.33 -55.91 -55.99
#3 -55 3.38 -55.84 4.37 4.34 4.33 -55.91 -55.99
#4 -55 3.38 -55.95 4.35 4.34 4.33 -55.91 -55.99
#5 -55 3.38 -55.86 4.38 4.34 4.33 -55.91 -55.99
#6 -55 3.38 -55.98 4.35 4.34 4.33 -55.91 -55.99
#7 -55 3.38 -55.84 4.37 4.34 4.33 -55.91 -55.99
#8 -55 3.38 -55.96 4.35 4.34 4.33 -55.91 -55.99
#9 -55 7.38 -55.86 7.45 7.43 7.41 -55.91 -55.99
#10 -55 7.38 -55.87 7.46 7.43 7.41 -55.91 -55.99
Calculations:
df["count"] = df.groupby("Sim_dist")["Sim_dist"].transform("count")
df["percent_missed"] = 100 - df["count"]
df[list(df.columns[-6:])] = df[list(df.columns[-6:])].mask(df[list(df.columns[-6:])].duplicated(),"")
df
# Sim_Az Sim_dist RUT_Azi RUT_Dist Mean_dist Min_dist Mean_azi Min_azi count percent_missed
#0 -55 3.38 -55.85 4.38 4.34 4.33 -55.91 -55.99 9 91
#1 -55 3.38 -55.86 4.37
#2 -55 3.38 -55.84 4.37
#3 -55 3.38 -55.84 4.37
#4 -55 3.38 -55.95 4.35
#5 -55 3.38 -55.86 4.38
#6 -55 3.38 -55.98 4.35
#7 -55 3.38 -55.84 4.37
#8 -55 3.38 -55.96 4.35
#9 -55 7.38 -55.86 7.45 7.43 7.41 -55.91 -55.99 2 98
#10 -55 7.38 -55.87 7.46
This can be broken down to make it clearer:
countandpercent_missedcolumns created first (which must then also be duplicates, so will also be removed).list(df.columns[-6:])is the list of the columns that contain the duplicates (the last 6 columns, includingcountandpercent_missed)df[list(df.columns[-6:])]returns the dataframe with those columns only.df[list(df.columns[-6:])].duplicated()returns a Boolean series of as to whether all columns are a duplicate of another row before it (so every single column must be the same).df.mask(condition, other)replaces values withotheranytime the condition isTrue.
I realise the output is slightly different, but this is so that the other columns also align. I'm not sure it makes sense to change Sim_dist from 7.38 to 3.38 for the second row of not-deleted duplicates.
If, however, this is how you want to do it, you can use drop_duplicates() in a copy of the dataframe, then merge the first columns (:-4) of the first dataframe with the last columns (-6:) of the second:
df2 = df.copy()
df2["count"] = df2.groupby("Sim_dist")["Sim_dist"].transform("count")
df2["percent_missed"] = 100 - df2["count"]
df2 = df2.iloc[:, -6:].drop_duplicates().reset_index()
df3 = df.iloc[:, :-4].merge(df2, left_index=True, right_index=True, how="outer").drop(columns=["index"]).fillna("")
df3
# Sim_Az Sim_dist RUT_Azi RUT_Dist Mean_dist Min_dist Mean_azi Min_azi count percent_missed
#0 -55 3.38 -55.85 4.38 4.34 4.33 -55.91 -55.99 9.0 91.0
#1 -55 3.38 -55.86 4.37 7.43 7.41 -55.91 -55.99 2.0 98.0
#2 -55 3.38 -55.84 4.37
#3 -55 3.38 -55.84 4.37
#4 -55 3.38 -55.95 4.35
#5 -55 3.38 -55.86 4.38
#6 -55 3.38 -55.98 4.35
#7 -55 3.38 -55.84 4.37
#8 -55 3.38 -55.96 4.35
#9 -55 7.38 -55.86 7.45
#10 -55 7.38 -55.87 7.46
- This assumes you have not named your index at all (and it has been reset, if not, then use
df.reset_index().iloc[:, :-4]in the last line).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rawson |
