'Sklearn Pipeline classifier throwing ValueError even when the missing values are taken care of
I have created sklearn pipeline for preprocessing and then running the model over the processed data. The preprocessing step takes care of missing values even after that it throws the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The below is my code :
def test_sklearn_pipeline(random_state_num):
numeric_features = ["x","y"]
categorical_features = ["wconfid","pctid"]
missing_features = ["x"]
missing_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="mean"))]
)
scale_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[('ohe',OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("miss", missing_transformer, missing_features),
("cat", categorical_transformer, categorical_features),
('outlier_remover',outlier_removal,numeric_features),
("num", scale_transformer, numeric_features)
],remainder='passthrough'
)
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LinearRegression())]
)
df = pd.read_csv('accelerometer_modified.csv')
df = df.drop(columns=['random'])
X,y = df.drop(columns=['z']),df.loc[:,'z']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state_num)
clf.fit(X_train, y_train)
print("MSE: %.3f" % mean_squared_error(clf.predict(X_test), y_test))
Solution 1:[1]
Numeric features and missing features do have the column x in common. Columntransformer runs each transformation in the input dataframe. This means you are running the standard scaler in the raw column and not the imputed one. You probably need two transformers that run sequentially, or rather put a small Pipeline as you've done already with steps that are first impute second scale
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | marc_s |
