'Groupby with apply

I'm not able to figure out the error -

AttributeError: 'numpy.int64' object has no attribute 'startswith' generated from the line - "year_indices = [x for x in sum_series if x.startswith("emissions")]"

This is my code:

import pandas as pd

URL = "https://www.epa.gov/sites/default/files/2021-03/state_tier1_caps.xlsx"

df = pd.read_excel(URL, sheet_name="State_Trends", header=1)

def minmaxyear(subdf):
    sum_series = subdf.sum()
    year_indices = [x for x in sum_series if x.startswith("emissions")]
    minyear = sum_series[year_indices].astype(float).idxmin()
    maxyear = sum_series[year_indices].astype(float).idxmax()
    return pd.Series({"min year": minyear[-2:], "max year": maxyear[-2:]})

df_years = df[df["Pollutant"]=="CO"].groupby("State").apply(minmaxyear)
print(df_years)


Solution 1:[1]

@lytseeker's answer does solve the error you were getting but it returns an empty series. Meaning the lines getting the minyear and maxyear would throw an error.

This is because it iterates through the values of sum_series and not the indices.

So do this instead.

year_indices = [x for x in sum_series.index if x.startswith("emissions")]

After making this change I was able to get this output -

       min year max year
State                  
AK          07       02
AL          21       90
AR          21       90
AZ          21       90
CA          21       90
CO          16       90
CT          21       90
DC          17       90
DE          21       90
FL          17       01
GA          17       90

Solution 2:[2]

Since x is an integer, in order to do a use a string function on it (startswith), you should first convert it to str:

year_indices = [x for x in sum_series if str(x).startswith("emissions")]

Solution 3:[3]

You can pass mask for compare indices of Series:

def minmaxyear(subdf):
    sum_series = subdf.sum()
    year = sum_series[sum_series.index.str.startswith("emissions")].astype(float)
    minyear = year.idxmin()
    maxyear = year.idxmax()
    return pd.Series({"min year": minyear[-2:], "max year": maxyear[-2:]})

df_years = df[df["Pollutant"]=="CO"].groupby("State").apply(minmaxyear)
print(df_years.head())
      min year max year
State                  
AK          07       02
AL          21       90
AR          21       90
AZ          21       90
CA          21       90

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Zero
Solution 2 lytseeker
Solution 3 jezrael