'Linear Regression ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

I'm working on a linear regression model and I'm getting the error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Here's my code:

    ### List Column Data Types for df

    # Convert "Paid' column to float64 by first changing NaN to 0
    Training_Data['Paid'].fillna(0).astype(float)

    # Convert 'Sale Price' column to float64 by first changing NaN to 0
    #print(df.loc[pd.to_numeric(df['Sale Price'], errors='coerce').isnull()])
    #pd.to_numeric(df['Sale Price']).astype(int)
    Training_Data["Sale Price"] = Training_Data["Sale 
    Price"].astype(str).str.strip().replace("",0).astype(float)

    # List Data Types
    Training_Data.dtypes

Which returns: Paid float64 Sale Price float64 dtype: object

    ### List Column Data Types for df2

    # Convert "Paid' column to float64 by first changing NaN to 0
    Test_Data['Paid'].fillna(0).astype(float)

    # Convert 'Sale Price' column to float64 by first changing NaN to 0
    #print(df.loc[pd.to_numeric(df['Sale Price'], errors='coerce').isnull()])
    #pd.to_numeric(df['Sale Price']).astype(int)
    Test_Data["Sale Price"] = Test_Data["Sale 
    Price"].astype(str).str.strip().replace("",0).astype(float)

    # List Data Types
    Test_Data.dtypes

Which returns: Paid float64 Sale Price float64 dtype: object

    ### Declare and Drop Dependent (Measured) Variable

    SourceData_train_independent = Training_Data.drop(['Sale Price'], axis = 1) # 
    Drop depedent variable from training dataset

    SourceData_train_dependent = Training_Data['Sale Price'].copy() # New dataframe 
    with only Dependent variable value for training dataset

    SourceData_test_independent = Test_Data.drop(['Sale Price'], axis = 1)

    SourceData_test_dependent = Test_Data['Sale Price'].copy()

    SourceData_train_independent.dtypes

Which returns: Paid float64 dtype: object

    ### Scaling Independent Train and Test Variable

    sc_X = StandardScaler()

    X_train = sc_X.fit_transform(SourceData_train_independent.values) #scale the 
    independent variables

    y_train = SourceData_train_dependent # scaling is not required for dependent 
    variable

    X_test = sc_X.transform(SourceData_test_independent)

    y_test = SourceData_test_dependent

Finally, when I run:

    ### Feeding Train Data

    reg = LinearRegression().fit(X_train, y_train)
    print("The Linear regression score on training data is ", 
    round(reg.score(X_train, y_train),2))

I get the error. So I'm thinking my file still has NaN values, which I thought I had corrected. Can anyone help? Thanks!



Solution 1:[1]

try this

def check_nan_inf(df):
    for col in df.columns:
        if df[col].isnull().any():
            print(col, 'has nan')
        if np.isinf(df[col]).any():
            print(col, 'has inf')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Kyriakos