'How to cross validate different filling methods for missing values?
I have a dataset with missing values which I like to fill. I would like to this with different methods which I then would like to compare to see which one shows the best performance. I am new to this kind of problem and was now thinking to best make a comparison using some test and training data using sklearn. I would like to get some statistical meaningful parameters on which I could then make a educated decision which method I wanna chose for my data.
My original data has over 60'000 rows and looks as follows:
datetime | A |
-----------------------|-----|
07/12/2014 01:00:00 | 102 |
07/12/2014 02:00:00 | Na |
07/12/2014 03:00:00 | 12 |
07/12/2014 04:00:00 | 98 |
07/12/2014 05:00:00 | Na |
07/12/2014 06:00:00 | 34 |
My code so far looks something like this:
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = load_data(data_dir, file_gen_test)
df_only_values = df[~df['A'].isna()]
df_only_values['month'] = df_only_values['Datetime'].dt.month
df_only_values['hour'] = df_only_values['Datetime'].dt.hour
df_only_values['year'] = df_only_values['Datetime'].dt.year
TargetVariable = ['A']
Predictors = ['month','hour','year']
X = df_only_values[Predictors].values
y = df_only_values[TargetVariable].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
df["forward"] = df['A'].ffill(axis=0)
df["backward"] = df['A'].bfill(axis=0)
f["linear"] = df['A'].interpolate()
df["barycentric"] = df['A'].interpolate(method='barycentric')
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit(X_test, y_test)
My question is now two fold. How can should I pass the values to this different methods as some are acting only on "Na" and some cannot accept "Na"? And can I compare this different approaches in the best way? I am aware that I probably made some stupid mistakes and would be really glad if you could point them out and success another approach as I am still a newby.
Many thanks already in advance for all your help.
Best fidu13
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
