'What is the difference between x_test, x_train, y_test, y_train in sklearn?
I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split.
In the Documentation, I found some examples but it wasn't sufficient to end my doubts.
Does the code use the x_train to predict the x_test or use the x_train to predict the y_test?
What is the difference between train and test? Do I use train to predict the test or something similar?
I'm very confused about it. I will let below the example provided in the Documentation.
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
Solution 1:[1]
Let's say we have this data
Age Sex Disease
---- ------ | ---------
X_train | y_train )
)
5 F | A Disease )
15 M | B Disease )
23 M | B Disease ) training
39 M | B Disease ) data
61 F | C Disease )
55 M | F Disease )
76 F | D Disease )
88 F | G Disease )
-------------|------------
X_test | y_test
63 M | C Disease )
46 F | C Disease ) test
28 M | B Disease ) data
33 F | B Disease )
X_train contains the values of the features (age and sex => training data)
y_train contains the target output corresponding to X_train values (disease => training data) (what values we should find after training process)
There are also values generated after training process (predictions) which should be very close or the same with y_train values if the model is a successful one.
X_test contains the values of the features to be tested after training (age and sex => test data)
y_test contains the target output (disease => test data) corresponding to X_test (age and sex => training data) and will be compared to prediction value with given X_test values of the model after training in order to determine how successful the model is.
Solution 2:[2]
You're supposed to train your classifier / regressor using your training set, and test / evaluate it using your testing set.
Your classifier / regressor uses x_train to predict y_pred and uses the difference between y_pred and y_train (through a loss function) to learn. Then you evaluate it by computing the loss between the predictions of x_test (that could also be named y_pred), and y_test.
Solution 3:[3]
Consider X as 1000 data points and Y as integer class label (to which class each data point belongs)
Eg:
X = [1.24 2.36 3.24 ... (1000 terms)
Y = [1,0,0,1.....(1000 terms)]
We are splitting in 600:400 ratio
X_train => will have 600 data points
Y_train => will have 400 data points
X_test=> will have class labels corresponding to 600 data points
Y_test=> will have class labels corresponding to 400 data points
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Julio Nobre |
| Solution 2 | |
| Solution 3 | Ramkumar Thayumanavan |
