'Pandas, group by with max return AssertionError:

There's something wrong with pandas, and I would like your opinion,

I've this Dataframe where I need to get the max values, code is just below,

df_stack=pd.DataFrame([[1.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
        47.6122, -122.33799, 1927.0, 57.85220900338872,
        59.91269863912585],
       [1.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
        47.61317, -122.33393, 1996.0, 55.82342114189166,
        56.86951201265458],
       [3.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
        47.61393, -122.3381, 1969.0, 76.68191235628086,
        77.37931271575705],
       [5.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
        47.61412, -122.33664, 1926.0, 68.53505428597694,
        71.00764283155655],
       [8.0, 2016.0, 'NonResidential', 'Hotel', 98121.0, 'DOWNTOWN',
        47.61375, -122.34047, 1980.0, 67.01346098859122,
        68.34485815906346]], columns=['OSEBuildingID', 'DataYear', 'BuildingType', 'PrimaryPropertyType', 
 'ZipCode', 'Neighborhood', 'Latitude', 'Longitude', 'YearBuilt', 
 'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)' ])

When I run the code below :

df_stack[['OSEBuildingID', 
          'DataYear', 
          'BuildingType', 
          'PrimaryPropertyType', 
          'ZipCode', 'Neighborhood', 'Latitude', 'Longitude', 
          'YearBuilt', 'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)']].groupby('OSEBuildingID').max()

I get an error, "AssertionError: " the same you'll probably get if you try this. But, when I comment this two columns and I run the code again

df_stack[['OSEBuildingID', 
          'DataYear', 
          #'BuildingType', 
          #'PrimaryPropertyType', 
          'ZipCode', 'Neighborhood', 'Latitude', 'Longitude', 
          'YearBuilt', 'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)']].groupby('OSEBuildingID').max()

I get the results

     DataYear  ZipCode Neighborhood  Latitude  Longitude  YearBuilt  SourceEUI(KWm2)  SourceEUIWN(KWm2)
OSEBuildingID                                                                                                    
1.0              2016.0  98101.0     DOWNTOWN  47.61317 -122.33393     1996.0        57.852209          59.912699
3.0              2016.0  98101.0     DOWNTOWN  47.61393 -122.33810     1969.0        76.681912          77.379313
5.0              2016.0  98101.0     DOWNTOWN  47.61412 -122.33664     1926.0        68.535054          71.007643
8.0              2016.0  98121.0     DOWNTOWN  47.61375 -122.34047     1980.0        67.013461          68.344858

If I replace max() by mean() I can uncomment those 2 lines and runc the code with no problem. This behaviour it only happens with max() and min(), well I just test max, mean and min, But I need to get the max.

Thank you if can help.



Solution 1:[1]

This was a regression in 1.0.0 that was fixed with '1.0.1', so I suggest you upgrade your version.

Fixed regression in .groupby().agg() raising an AssertionError for some reductions like min on object-dtype columns (GH31522)

Solution 2:[2]

Carlos Carvalho , when I run this code, I don't get any errors. Can you confirm you still get the error if you copy and paste this into your terminal? As implied in a comment above, it might have to do with your version. Also, BuildingType and PrimaryPropertyTypes are objects and not floats, but it should still work:

df_stack=pd.DataFrame([[1.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
        47.6122, -122.33799, 1927.0, 57.85220900338872,
        59.91269863912585],
       [1.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
        47.61317, -122.33393, 1996.0, 55.82342114189166,
        56.86951201265458],
       [3.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
        47.61393, -122.3381, 1969.0, 76.68191235628086,
        77.37931271575705],
       [5.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
        47.61412, -122.33664, 1926.0, 68.53505428597694,
        71.00764283155655],
       [8.0, 2016.0, 'NonResidential', 'Hotel', 98121.0, 'DOWNTOWN',
        47.61375, -122.34047, 1980.0, 67.01346098859122,
        68.34485815906346]], columns=['OSEBuildingID', 'DataYear', 'BuildingType', 
                                      'PrimaryPropertyType', 
 'ZipCode', 'Neighborhood', 'Latitude', 'Longitude', 'YearBuilt', 
 'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)' ])
df_stack[['OSEBuildingID', 'DataYear', 'BuildingType', 'PrimaryPropertyType', 
          'ZipCode', 'Neighborhood', 'Latitude', 'Longitude', 'YearBuilt', 
          'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)']].groupby('OSEBuildingID').max()

Solution 3:[3]

I recently encountered this error with pandas version 1.3.2 and found that the issue was coming from having two columns with the same name. So for a dataframe with columns col1, val1, val1, calling df.groupby('col1').agg({'val1': np.min}) threw this error because there were two columns named val1

Solution 4:[4]

I also had this problem, but it was due to NaT pandas values on datetime columns. Be sure to use fillna on datetime column when this happens.

My Pandas Version is 1.3.2

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ALollz
Solution 2 David Erickson
Solution 3 sep4
Solution 4