'Splitting from a group using pandas
I have a data frame, called train having the following columns: The rows are ~9800
business_code cust_number name_customer clear_date buisness_year
0 U001 0200769623 WAL-MAR corp 2020-02-11 2020.0
1 U001 0200980828 BEN E 2019-08-08 2019.0
2 U001 0200792734 MDV/ trust 2019-12-30 2019.0
4 U001 0200769623 WAL-MAR foundation 2019-11-25 2019.0
5 CA02 0140106181 THE corporation 2019-12-04 2019.0
doc_id posting_date due_in_date baseline_create_date
0 1.930438e+09 2020-01-26 2020-02-10 2020-01-26
1 1.929646e+09 2019-07-22 2019-08-11 2019-07-22
2 1.929874e+09 2019-09-14 2019-09-29 2019-09-14
4 1.930148e+09 2019-11-13 2019-11-28 2019-11-13
5 2.960581e+09 2019-09-20 2019-10-04 2019-09-24
cust_payment_terms converted_usd
0 NAH4 54273.28
1 NAD1 79656.6
2 NAA8 2253.86
4 NAH4 33133.29
5 CA10 15558.088
We had used groupby in pandas to do something like this:
dt=train.groupby('name_customer')['delay'].mean(numeric_only=False)
When we print dt, we have output something like this:
name_customer
11078 us 17.0
17135 associates -10.0
17135 llc -3.0
236008 associates -3.0
99 CE 2.0
...
YEN BROS corp 0.0
YEN BROS corporation -0.5
YEN BROS llc -2.0
ZARCO co -1.0
ZIYAD us 6.0
Name: delay, Length: 3889, dtype: float64
Is there any way to extract the average mean from the second row in dt dataframe and add it to the train dataset? I am fairly new to csv and dataframes, so sorry if this sounds stupid.
Solution 1:[1]
df['avg_delay']=np.arange(df.shape[0])
a=df.groupby('name_customer')['Delay'].mean(numeric_only=False)
df.avg_delay=a
df.avg_delay
avg_delay has around 47000 rows, a has 4152 rows
in output we are seeing the nan or null values towards the start and end #but between we have our values
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | General Grievance |
