'pandas dataframe groupby sum returns wrong answer

I am trying to groupby a dataframe with 4000+ rows by 'summing' a column called 'From Email Address'.

sum_df = df.groupby(['From Email Address' ], as_index=False).agg({'Name':'first', 'From Email Address':'first',
                                                            'Country':'first', 'Subject':'first','Gross': 'sum'}) #grouped by dataframe'
  

An 'From email address' called '[email protected]' has 200+ entries(rows) in the file. The value returned for that particular row is peculiar:

-10-10-10-49-10-8-30-5.8-79-90-10-10-49-10-468-7-90-10-79-5-10-6.66-1,500.00-49-10-10-8-49-4.99-10-10-10-2-4.74-49-49-90-7.5-35-10-90-7.5-7.5-17.15-10-9-42.5-10-10-468-49-33.2-49-10-15.8-10-3.22-49-5-140.4-6.13-10-7.75-10-51.35-10-49-90-3.22-90-10-5-2.9-6-79-74-6.66-49-468-10-10-49-8.38-10-2.6-149-250-10-31.6-10-49-7-10-10-5-36.5-66.2-49-3-35.7-10-10-10-5-49-5-10-49-10-10-79-406.38-29.5-10-49-49-90-10-10-10-49-50-4.7-25.48-10-3.22-10-10-10-3.22-1.6-49-49-49-49-49

whereas all other rows outputs correct value. Can anyone please give insight about how to solve this? Thanks!

I tried:

sum_df2 = df.groupby('From Email Address', as_index=False)['Gross'].sum()

The same value returned for this too.



Solution 1:[1]

It appears that the numbers are being stored as strings. For example, there are fragments like 1,500.00.

Because of the way the + operator acts on strings (concatenation), the summation is concatenating a bunch of them.

The solution is to make sure you first convert your data to numeric types.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Paddy Alton