'Slow groupby in Pandas
I have a dataframe an example of which is below. In all, the dataframe is 928,571 rows. I am using Pandas groupby to calculate the mean SMM by sector, seasoned, and incentive use the below code. On a small dataset 200 rows it works fine. However, when aggregating the full pandas dataframe the calculation takes over 18 hours and still does not complete. I have tried several possible solutions I found on SO but not has helped. Any ideas would be appreciated as I will have to turn to a database solution using sql to execute python which seems like overkill for this problem. I feel there is a 'trick' here that I am missing.
ModelData = ModelViewData.groupby(
['sector', 'seasoned', 'incentive'],
as_index=False
)[['smm']].agg('mean')
data = {
'Sector': ['LLB', 'MLB', 'HLB'],
'Seasoned':['new', 'moderate', 'seasoned'],
'RPB': [80000, 125000, 175000],
'Incentive':[.25, .25, .25],
'SMM': [.006, .007, .008]
}
Solution 1:[1]
I really don't see where this would be slow. I generated some random data to get a somewhat similar kind of scenario to your data:
import random
import pandas
df = pandas.DataFrame({
'sector': ['LLB', 'MLB', 'HLB']+["".join([random.choice("ABCDEXYZ") for _ in range(3)]) for _ in range(1000000)],
'seasoned': ['new', 'moderate', 'seasoned']+["".join([random.choice("FGHIJXYZ") for _ in range(2)]) for _ in range(1000000)],
'rpb': [80000, 125000, 175000]+[random.randint(100, 300000) for _ in range(1000000)],
'incentive': [.25, .25, .25]+[random.choice([.0, .25, .5, .75]) for _ in range(1000000)],
'smm': [.006, .007, .008]+[random.random() for _ in range(1000000)]
})
Which results in around 1 million rows and 131011 different aggregations.
Then calling this:
ModelData = df.groupby(['sector', 'seasoned', 'incentive'], as_index=False)[['smm']].agg('mean')
Still runs in less than one second for me. I also tried smaller number of random choices (aka smaller amount of different aggregations) but it is the same for them.
Perhaps show the full way of getting from the data to the aggregation as this might not actually be the spot which is causing the near endless computation. Also when you get into extreme long loading times for large dataframes that suggests that somewhere an at least quadratic loop is happening which might be unintended (a classic mistake is e.g. calling apply for each row so that for each row the results for the entire dataframes are calculated).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ewz93 |
