'What is the difference between x_test, x_train, y_test, y_train in sklearn?

I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split.

In the Documentation, I found some examples but it wasn't sufficient to end my doubts.

Does the code use the x_train to predict the x_test or use the x_train to predict the y_test?

What is the difference between train and test? Do I use train to predict the test or something similar?

I'm very confused about it. I will let below the example provided in the Documentation.

>>> import numpy as np  
>>> from sklearn.model_selection import train_test_split  
>>> X, y = np.arange(10).reshape((5, 2)), range(5)  
>>> X
array([[0, 1], 
       [2, 3],  
       [4, 5],  
       [6, 7],  
       [8, 9]])  
>>> list(y)  
[0, 1, 2, 3, 4] 
>>> X_train, X_test, y_train, y_test = train_test_split(  
...     X, y, test_size=0.33, random_state=42)  
...  
>>> X_train  
array([[4, 5], 
       [0, 1],  
       [6, 7]])  
>>> y_train  
[2, 0, 3]  
>>> X_test  
array([[2, 3], 
       [8, 9]])  
>>> y_test  
[1, 4]  
>>> train_test_split(y, shuffle=False)  
[[0, 1, 2], [3, 4]]


Solution 1:[1]

Let's say we have this data

Age    Sex       Disease
----  ------ |  ---------
  
  X_train    |   y_train   )
                           )
 5       F   |  A Disease  )
 15      M   |  B Disease  ) 
 23      M   |  B Disease  ) training
 39      M   |  B Disease  ) data
 61      F   |  C Disease  )
 55      M   |  F Disease  )
 76      F   |  D Disease  )
 88      F   |  G Disease  )
-------------|------------
   
  X_test     |    y_test

 63      M   |  C Disease  )
 46      F   |  C Disease  ) test
 28      M   |  B Disease  ) data
 33      F   |  B Disease  )

X_train contains the values of the features (age and sex => training data)

y_train contains the target output corresponding to X_train values (disease => training data) (what values we should find after training process)

There are also values generated after training process (predictions) which should be very close or the same with y_train values if the model is a successful one.

X_test contains the values of the features to be tested after training (age and sex => test data)

y_test contains the target output (disease => test data) corresponding to X_test (age and sex => training data) and will be compared to prediction value with given X_test values of the model after training in order to determine how successful the model is.

Solution 2:[2]

You're supposed to train your classifier / regressor using your training set, and test / evaluate it using your testing set.

Your classifier / regressor uses x_train to predict y_pred and uses the difference between y_pred and y_train (through a loss function) to learn. Then you evaluate it by computing the loss between the predictions of x_test (that could also be named y_pred), and y_test.

Solution 3:[3]

Consider X as 1000 data points and Y as integer class label (to which class each data point belongs)

Eg:
X = [1.24 2.36 3.24 ... (1000 terms)
Y = [1,0,0,1.....(1000 terms)]

We are splitting in 600:400 ratio

X_train => will have 600 data points

Y_train => will have 400 data points

X_test=> will have class labels corresponding to 600 data points

Y_test=> will have class labels corresponding to 400 data points

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Julio Nobre
Solution 2
Solution 3 Ramkumar Thayumanavan