'Pandas, group by with max return AssertionError:
There's something wrong with pandas, and I would like your opinion,
I've this Dataframe where I need to get the max values, code is just below,
df_stack=pd.DataFrame([[1.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
47.6122, -122.33799, 1927.0, 57.85220900338872,
59.91269863912585],
[1.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
47.61317, -122.33393, 1996.0, 55.82342114189166,
56.86951201265458],
[3.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
47.61393, -122.3381, 1969.0, 76.68191235628086,
77.37931271575705],
[5.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
47.61412, -122.33664, 1926.0, 68.53505428597694,
71.00764283155655],
[8.0, 2016.0, 'NonResidential', 'Hotel', 98121.0, 'DOWNTOWN',
47.61375, -122.34047, 1980.0, 67.01346098859122,
68.34485815906346]], columns=['OSEBuildingID', 'DataYear', 'BuildingType', 'PrimaryPropertyType',
'ZipCode', 'Neighborhood', 'Latitude', 'Longitude', 'YearBuilt',
'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)' ])
When I run the code below :
df_stack[['OSEBuildingID',
'DataYear',
'BuildingType',
'PrimaryPropertyType',
'ZipCode', 'Neighborhood', 'Latitude', 'Longitude',
'YearBuilt', 'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)']].groupby('OSEBuildingID').max()
I get an error, "AssertionError: " the same you'll probably get if you try this. But, when I comment this two columns and I run the code again
df_stack[['OSEBuildingID',
'DataYear',
#'BuildingType',
#'PrimaryPropertyType',
'ZipCode', 'Neighborhood', 'Latitude', 'Longitude',
'YearBuilt', 'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)']].groupby('OSEBuildingID').max()
I get the results
DataYear ZipCode Neighborhood Latitude Longitude YearBuilt SourceEUI(KWm2) SourceEUIWN(KWm2)
OSEBuildingID
1.0 2016.0 98101.0 DOWNTOWN 47.61317 -122.33393 1996.0 57.852209 59.912699
3.0 2016.0 98101.0 DOWNTOWN 47.61393 -122.33810 1969.0 76.681912 77.379313
5.0 2016.0 98101.0 DOWNTOWN 47.61412 -122.33664 1926.0 68.535054 71.007643
8.0 2016.0 98121.0 DOWNTOWN 47.61375 -122.34047 1980.0 67.013461 68.344858
If I replace max() by mean() I can uncomment those 2 lines and runc the code with no problem. This behaviour it only happens with max() and min(), well I just test max, mean and min, But I need to get the max.
Thank you if can help.
Solution 1:[1]
This was a regression in 1.0.0
that was fixed with '1.0.1'
, so I suggest you upgrade your version.
Fixed regression in .groupby().agg() raising an AssertionError for some reductions like min on object-dtype columns (GH31522)
Solution 2:[2]
Carlos Carvalho , when I run this code, I don't get any errors. Can you confirm you still get the error if you copy and paste this into your terminal? As implied in a comment above, it might have to do with your version. Also, BuildingType and PrimaryPropertyTypes are objects and not floats, but it should still work:
df_stack=pd.DataFrame([[1.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
47.6122, -122.33799, 1927.0, 57.85220900338872,
59.91269863912585],
[1.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
47.61317, -122.33393, 1996.0, 55.82342114189166,
56.86951201265458],
[3.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
47.61393, -122.3381, 1969.0, 76.68191235628086,
77.37931271575705],
[5.0, 2016.0, 'NonResidential', 'Hotel', 98101.0, 'DOWNTOWN',
47.61412, -122.33664, 1926.0, 68.53505428597694,
71.00764283155655],
[8.0, 2016.0, 'NonResidential', 'Hotel', 98121.0, 'DOWNTOWN',
47.61375, -122.34047, 1980.0, 67.01346098859122,
68.34485815906346]], columns=['OSEBuildingID', 'DataYear', 'BuildingType',
'PrimaryPropertyType',
'ZipCode', 'Neighborhood', 'Latitude', 'Longitude', 'YearBuilt',
'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)' ])
df_stack[['OSEBuildingID', 'DataYear', 'BuildingType', 'PrimaryPropertyType',
'ZipCode', 'Neighborhood', 'Latitude', 'Longitude', 'YearBuilt',
'SourceEUI(KWm2)', 'SourceEUIWN(KWm2)']].groupby('OSEBuildingID').max()
Solution 3:[3]
I recently encountered this error with pandas version 1.3.2 and found that the issue was coming from having two columns with the same name. So for a dataframe with columns col1, val1, val1
, calling df.groupby('col1').agg({'val1': np.min})
threw this error because there were two columns named val1
Solution 4:[4]
I also had this problem, but it was due to NaT pandas values on datetime columns. Be sure to use fillna
on datetime column when this happens.
My Pandas Version is 1.3.2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | ALollz |
Solution 2 | David Erickson |
Solution 3 | sep4 |
Solution 4 |