'Sklearn multiple training sets
I'm meddling with sklearn and diabetes dataset in order to create linear regression. So far I've done:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
Then I have chosen 3 columns - indexes 0,2 and 3 - age, bmi and bp.
diabetes_Xage = diabetes_X[:, np.newaxis, 0] #age
diabetes_Xbmi = diabetes_X[:, np.newaxis, 2] #bmi
diabetes_Xbp = diabetes_X[:, np.newaxis, 3] #bp
Then I split data 80/20 but i want to combine 4 data sets. I've done it like this:
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
diabetes_Xage, diabetes_y, test_size=0.8, random_state=0)
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
diabetes_Xbmi, diabetes_y, test_size=0.8, random_state=0)
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
diabetes_Xbp, diabetes_y, test_size=0.8, random_state=0)
Now I'm trying to make linear regression and coefficients
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
#coefficients
print("Coefficients: \n", regr.coef_)
#mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
#coefficient of determination
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))
And the outcome is:
Coefficients:
[815.11490401]
Mean squared error: 4695.76
Coefficient of determination: 0.18
My problem is that I have 3 datasets and the code I currently prepared takes into account only the last entered dataset (diabetes_Xbp). How should I correct the code so that the result shows the outcome of all 4 data sets combined?
Solution 1:[1]
Everytime you call train_test_split() you are overwriting the previous variable assignments to diabetes_X_train, diabetes_X_test.
I would first store the 3 diabetes variables in a single np array:
diabetes = diabetes_X[:,[0,2,3]]
Then you can make a single call to the data splitter
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
diabetes, diabetes_y, test_size=0.8, random_state=0)
Additionally, setting test_size=0.8 means you are training on 20% of data and evaluating on 80%. I think you want that the other way around.
Regarding your final question, whether performance will go up with additional data, is hard to say. Mostly likely some additional features will improve performance, but can also lead to overfitting. Try taking a look at sklearn's feature selection methods.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sean |
