'Splitting from a group using pandas

I have a data frame, called train having the following columns: The rows are ~9800

  business_code cust_number       name_customer clear_date  buisness_year  
0          U001  0200769623        WAL-MAR corp 2020-02-11         2020.0   
1          U001  0200980828              BEN E  2019-08-08         2019.0   
2          U001  0200792734          MDV/ trust 2019-12-30         2019.0   
4          U001  0200769623  WAL-MAR foundation 2019-11-25         2019.0   
5          CA02  0140106181    THE  corporation 2019-12-04         2019.0   

         doc_id posting_date due_in_date baseline_create_date  
0  1.930438e+09   2020-01-26  2020-02-10           2020-01-26   
1  1.929646e+09   2019-07-22  2019-08-11           2019-07-22   
2  1.929874e+09   2019-09-14  2019-09-29           2019-09-14   
4  1.930148e+09   2019-11-13  2019-11-28           2019-11-13   
5  2.960581e+09   2019-09-20  2019-10-04           2019-09-24   

  cust_payment_terms converted_usd  
0               NAH4      54273.28  
1               NAD1       79656.6  
2               NAA8       2253.86  
4               NAH4      33133.29  
5               CA10     15558.088   

We had used groupby in pandas to do something like this:

dt=train.groupby('name_customer')['delay'].mean(numeric_only=False)

When we print dt, we have output something like this:

name_customer
11078 us                17.0
17135 associates       -10.0
17135 llc               -3.0
236008 associates       -3.0
99 CE                    2.0
                        ... 
YEN BROS corp            0.0
YEN BROS corporation    -0.5
YEN BROS llc            -2.0
ZARCO co                -1.0
ZIYAD  us                6.0
Name: delay, Length: 3889, dtype: float64

Is there any way to extract the average mean from the second row in dt dataframe and add it to the train dataset? I am fairly new to csv and dataframes, so sorry if this sounds stupid.



Solution 1:[1]

df['avg_delay']=np.arange(df.shape[0])
a=df.groupby('name_customer')['Delay'].mean(numeric_only=False)


df.avg_delay=a

df.avg_delay

avg_delay has around 47000 rows, a has 4152 rows

in output we are seeing the nan or null values towards the start and end #but between we have our values

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 General Grievance