'Scikit-learn Imputer with multiple values
Is there a way for a Scikit-learn Imputer to look for and replace multiple values which are considered "missing values"?
For example, I would like to do something like
imp = Imputer(missing_values=(7,8,9))
But according to the docs, the missing_values parameter only accepts a single integer:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan, use the string value “NaN”.
Solution 1:[1]
Why not to do this manually in your original dataset? Assuming you are using pd.DataFrame you can do the following:
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
df = pd.DataFrame({'A': [1, 2, 3, 8], 'B': [1, 2, 5, 3]})
df_new = df.replace([1, 2], np.nan)
df_imp = Imputer().fit_transform(df_new)
This results in df_imp:
array([[ 5.5, 4. ],
[ 5.5, 4. ],
[ 3. , 5. ],
[ 8. , 3. ]])
If you want to make this a part of a pipeline, you would just need to implement a custom transformer with a similar logic.
Solution 2:[2]
You could chain multiple imputers in a pipeline, but that might become hectic pretty soon and I'm not sure how efficient that is.
pipeline = make_pipeline(
SimpleImputer(missing_values=7, strategy='constant', fill_value=10),
SimpleImputer(missing_values=8, strategy='constant', fill_value=10),
SimpleImputer(missing_values=9, strategy='constant', fill_value=10)
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jan K |
| Solution 2 |
