'Pandas groupby() different output with versions 0.23.4 and 1.3.4
I have 2 codebases with the same code, the only difference is the version of pandas being used:
- OLD environment uses pandas version 0.23.4
- NEW environment uses pandas version 1.3.4
I have debugged my code up to this line of code, after which the result is different:
result = df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
Variables df, group_items, as_index, sort and sum_items are all exactly the same between both NEW and OLD environments.
However, the returned result is a little bit different in the NEW version. Specifically, the output looks like this:
NEW environment:
df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
SST_ADJ_TYPE SST_ADJ_RULE ... NCI AMOUNT
0 0 SST22a,SST22b, ... 1874757.0
1 0 SST22a,SST22b, ... 5945263.0
2 0 SST22a,SST22b, ... 4303110.0
3 0 SST22a,SST22b, ... 5342991.0
4 0 SST22a,SST22b, ... 9245478.0
... ... ... ... .. ...
133674 3 SST22b,SST07, ... 4164305.0
133675 3 SST22b,SST07, ... 7280203.0
133676 3 SST22b,SST07, ... 1235752.0
133677 3 SST22b,SST07, ... 3115825.0
133678 3 SST22b,SST07, ... 1436891.0
[133679 rows x 16 columns]
OLD environment:
df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
SST_ADJ_TYPE SST_ADJ_RULE ... NCI AMOUNT
0 0 SST22a,SST22b, ... 1874757.0
1 0 SST22a,SST22b, ... 5945263.0
2 0 SST22a,SST22b, ... 4303110.0
3 0 SST22a,SST22b, ... 5342991.0
4 0 SST22a,SST22b, ... 9245478.0
5 0 SST22a,SST22b, ... 4016202.0
6 0 SST22a,SST22b, ... 8799969.0
7 0 SST22a,SST22b, ... 1503269.0
8 0 SST22a,SST22b, ... 6385991.0
9 0 SST22a,SST22b, ... 1686520.0
10 0 SST22a,SST22b, ... 5287114.0
11 0 SST22a,SST22b, ... 2648534.0
12 0 SST22a,SST22b, ... 6159017.0
13 0 SST22a,SST22b, ... 5959591.0
14 0 SST22a,SST22b, ... 5809998.0
15 0 SST22a,SST22b, ... 4929077.0
16 0 SST22a,SST22b, ... 9166004.0
17 0 SST22a,SST22b, ... 2124498.0
18 0 SST22a,SST22b, ... 3051659.0
19 0 SST22a,SST22b, ... 1859001.0
20 0 SST22a,SST22b, ... 8522834.0
21 0 SST22a,SST22b, ... 7803526.0
22 0 SST22a,SST22b, ... 4067546.0
23 0 SST22a,SST22b, ... 9218486.0
24 0 SST22a,SST22b, ... 1453153.0
25 0 SST22a,SST22b, ... 7411706.0
26 0 SST22a,SST22b, ... 9160444.0
27 0 SST22a,SST22b, ... 6255426.0
28 0 SST22a,SST22b, ... 6007841.0
29 0 SST22a,SST22b, ... 4744588.0
... ... ... ... .. ...
133649 3 SST22b,SST07, ... 6487572.0
133650 3 SST22b,SST07, ... 3593805.0
133651 3 SST22b,SST07, ... 9192954.0
133652 3 SST22b,SST07, ... 2394981.0
133653 3 SST22b,SST07, ... 9398971.0
133654 3 SST22b,SST07, ... 5536294.0
133655 3 SST22b,SST07, ... 8759613.0
133656 3 SST22b,SST07, ... 2012212.0
133657 3 SST22b,SST07, ... 7930551.0
133658 3 SST22b,SST07, ... 3407871.0
133659 3 SST22b,SST07, ... 3071541.0
133660 3 SST22b,SST07, ... 1863129.0
133661 3 SST22b,SST07, ... 8439646.0
133662 3 SST22b,SST07, ... 1518097.0
133663 3 SST22b,SST07, ... 7396702.0
133664 3 SST22b,SST07, ... 8470274.0
133665 3 SST22b,SST07, ... 8363095.0
133666 3 SST22b,SST07, ... 1115614.0
133667 3 SST22b,SST07, ... 6317772.0
133668 3 SST22b,SST07, ... 2645613.0
133669 3 SST22b,SST07, ... 6555039.0
133670 3 SST22b,SST07, ... 5274987.0
133671 3 SST22b,SST07, ... 5779789.0
133672 3 SST22b,SST07, ... 6974948.0
133673 3 SST22b,SST07, ... 6370779.0
133674 3 SST22b,SST07, ... 4164305.0
133675 3 SST22b,SST07, ... 7280203.0
133676 3 SST22b,SST07, ... 1235752.0
133677 3 SST22b,SST07, ... 1436891.0
133678 3 SST22b,SST07, ... 3115825.0
[133679 rows x 16 columns]
As you can see, the amount of rows and columns is the same.
The columns are also exactly the same between the two results.
However, when you check the AMOUNT column, you see that, for example in the last rows, the result from the NEW environment has the values combined (last row swapped for the previous row, for example).
Any ideas why is this happening?
PS: Unfortunately, I can not provide a DataFrame which you can load since the DataFrame I'm using has lots of data in it. I'm more of looking to a theoretical answer on what changed between the above mentioned versions of pandas and/or which argument to use in the NEW environment to have the exact same result as in the OLD environment.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
