'Pandas Groupby: Count and mean combined

Working with pandas to try and summarise a data frame as a count of certain categories, as well as the means sentiment score for these categories.

There is a table full of strings that have different sentiment scores, and I want to group each text source by saying how many posts they have, as well as the average sentiment of these posts.

My (simplified) data frame looks like this:

source    text              sent
--------------------------------
bar       some string       0.13
foo       alt string        -0.8
bar       another str       0.7
foo       some text         -0.2
foo       more text         -0.5

The output from this should be something like this:

source    count     mean_sent
-----------------------------
foo       3         -0.5
bar       2         0.415

The answer is somewhere along the lines of:

df['sent'].groupby(df['source']).mean()

Yet only gives each source and it's mean, with no column headers.



Solution 1:[1]

In newer versions of pandas you don't need the rename anymore, just use named aggregation:

df = df.groupby('source') \
       .agg(count=('text', 'size'), mean_sent=('sent', 'mean')) \
       .reset_index()

print (df)
  source  count  mean_sent
0    bar      2      0.415
1    foo      3     -0.500

Solution 2:[2]

Below one should work fine:

df[['source','sent']].groupby('source').agg(['count','mean'])

Solution 3:[3]

For those who were looking for aggregations for more than two columns (as I were): just add those to 'agg'.

df = df.groupby(['id']).agg({'texts': 'size', 'char_num': 'mean', 'bytes': 'mean'}).reset_index()

Solution 4:[4]

A shorter version to achieve this is:

df.groupby('source')['sent'].agg(count='size', mean_sent='mean').reset_index()

The nice thing about this is that you can extend it if you want to take the mean of multiple variables but only count once. In this case you will have to pass a dictionary:

df.groupby('source')['sent1', 'sent2'].agg({'count': 'size', 'means': 'mean'}).reset_index()

Solution 5:[5]

I think this should provide the output that you wanted:

result = pd.DataFrame(df.groupby('source').size())

results['mean_score'] =  df.groupby('source').sent.mean()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Henry Ecker
Solution 2 h4z3
Solution 3 João
Solution 4 Henry Ecker
Solution 5 Henry Ecker