'Calculating the mix of values within a pandas groupby
Within a database I have fields 'DATE', 'country', 'itemID', and 'price'. What I'm trying to do is calculate the mix of items (which sum to 100%) across price bin by date group by country.
The following code calculates the number of objects by country/price bin/date (which is aggregated to weekly frequency) and I'm happy with the results.
bins = [0, 5000, 10000, 20000, 30000, 40000, float('inf')]
labels = ['<5k', '10k', '20k', '30k', '40k', '>40k']
df.groupby(['country', pd.Grouper(key='DATE', freq='1W'), pd.cut(df['price'], bins=bins, labels=labels, include_lowest=True)]).agg({'itemID': 'nunique'}).unstack('DATE')
What I'd like to do is, instead of showing me the number of items in each category, I get the percentage in each price bin for that country on that date.
So, for example:
| DATE | 2022-01-30 | 2022-02-06 | |
|---|---|---|---|
| country | price | ||
| -------- | -------- | ||
| USA | <5k | 0% | 1% |
| USA | 10k | 5% | 19% |
| USA | 20k | 55% | 25% |
| USA | 30k | 30% | 55% |
| USA | 40k | 2% | 0% |
| USA | >40k | 8% | 0% |
| JAP | <5k | 0% | 1% |
| JAP | 10k | 25% | 0% |
| JAP | 20k | 45% | 15% |
| JAP | 30k | 20% | 35% |
| JAP | 40k | 2% | 25% |
| JAP | >40k | 8% | 25% |
I tried the below to initially calculate the number of unique itemIDs by country and date and then accessing this in the second part:
def f(d):
return d['itemID'] / d['nuniques']
df['nuniques'] = df.groupby(['country', pd.Grouper(key='DATE', freq='1W')])['itemID'].transform('nunique')
df.groupby(['country', pd.Grouper(key='DATE', freq='1W'), pd.cut(df['price'], bins=bins, labels=labels, include_lowest=True)]).agg({'itemID': 'nunique', 'nuniques': 'min'}).apply(lambda x: f(x), axis=1).unstack('AS_OF_DATE')
Excluding the cut function, it shows 100% across each country (as expected).
However, adding the cut function causes the sum of the items across the bins to exceed 100% (by <1%, but different amounts for each date).
Why would this be, and how can it be calculated more properly? At a guess, the discrepancy is something to do with the fact that not every itemID has an entry for every date within the weekly window (e.g. if it goes out of stock).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
