'How to improve the prediction of missing data using sklearn regression?

I need to predict some missing data. I have a dataset of production values over the last 7 year which are supposedly reported hourly. However many datapoints are missing which is why I need to predict them. The data should be yearly periodical with similarity to a sinusoidal curve. Also a correlation to the production values around the same time period should exist. The data set is quite large with over 60'000 however only 52'000 rows has reported values. I currently drop all NA values first to make a prediction model and would then like to predict the missing data with this model.

With my current approach I get however only a value of r^2 of 0.019 and a mean square error of 4031 for the linear regression and for the nonlinear a r^2 of -0.12. With the RandomForestRegressor I got currently a r^2 of only 0.58. How can I approve this?

My current setup is as follows:

df = pd.read_csv(os.path.join(data_dir, file_gen), sep = ',')
df_only_values = df.dropna(subset=plants, how='any')
df_only_values.loc[:,'month']=df_only_values.loc[:,'Datetime'].dt.month
df_only_values.loc[:,'day']=df_only_values.loc[:,'Datetime'].dt.day
df_only_values.loc[:,'hour']=df_only_values.loc[:,'Datetime'].dt.hour
df_only_values.loc[:,'year']=df_only_values.loc[:,'Datetime'].dt.year
Predictors=['month','hour','year']
TargetVariable=['A']
X = df_only_values[Predictors].values
y = df_only_values[TargetVariable].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
y_train = y_train.ravel()
y_test = y_test.ravel()
#linear
regressor = LinearRegression().fit(X_train, y_train)
print(regressor.score(X_train, y_train))
y_pred = regressor.predict(X_test)
r2 = r2_score(y_test, y_pred)
#non linear
regr = svm.SVR()
regr.fit(X_train, y_train)
y_predict_2 = regr.predict(X_test)
r2_2 = r2_score(y_test, y_predict_2)
#RandomForestRegressor
regr = RandomForestRegressor()
regr.fit(X_train, y_train)
random_forest_predicted = regr.predict(X_test)
r2_3 = r2_score(y_test, random_forest_predicted)
print("r2_3 \n", r2_3)

Did I do something wrong? Or what other method would you recommend?

Many thanks already in advance for your help.

Best fidu13

My data looks like:

datetime               | A   |
-----------------------|-----|
07/12/2014  01:00:00   | 102 |
07/12/2014  02:00:00   |   0 |
07/12/2014  03:00:00   |  12 |
07/12/2014  04:00:00   |   0 |
07/12/2014  05:00:00   |   0 |
07/12/2014  06:00:00   |  34 |

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to improve the prediction of missing data using sklearn regression?

Sources

Related Questions