'Is dropna=True in pandas groupby useful?
I am not certain if this question is appropriate here, and apologies in advance if it is not.
I am a pandas maintainer, and recently I've been working on fixing bugs in pandas groupby when used with dropna=True and transform for the 1.5 release. For example, in pandas 1.4.2,
import pandas as pd
df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})
print(df.groupby('a', dropna=True).transform('sum'))
produces the incorrect (in particular, the last row) output
b
0 5
1 5
2 5
While working on this, I've been wondering how useful the dropna argument is in groupby. For aggregations (e.g. df.groupby('a').sum()) and filters (e.g. df.groupby('a').head(2)), it seems to me it's always possible to drop the offending rows prior to the groupby. In addition to this, in my use of pandas if I have null values in the groupers, then I want them in the groupby result. For transformations, where the resulting index should match that of the input, the value is instead filled with null. For the above code block, the output should be
b
0 5.0
1 5.0
2 NaN
But I can't imagine this result ever being useful. In case it is, it also is not too difficult to accomplish:
result = df.groupby('a', dropna=False).transform('sum')
result.loc[df['a'].isnull()] = np.nan
If we were able to deprecate and then remove the dropna argument to groupby (i.e. groupby always behaves as if dropna=False), then this would help simplify a good part of the groupby code.
So I'd like to ask if there are examples where dropna=True and the operation might be otherwise hard to accomplish.
Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
