'How can I split the data in 40% training, 30% validation and 30% test

This is what I coded. I am able to create the permutation and create the three variables that I need training, validation and test. I can't just figure out how to split the data in the percentages mentioned above. In few words, when I create those variables they should be created accordingly and then have to print the result. Thanks in advance

def split_data(data_dict, data_split):
    """divide the data into training, validate and test sets. 
    :param data_dict: a dictionary of the data with keys 'X' and 'Y'
    :param data_split: a list of the fraction of the data to be in each set of form 
    [training_fraction, validation_fraction, test_fraction]. The fractions should all add up to 1.
    :returns training_dict, validation_dict, test_dict: dictionaries of the same form as the data_dict, 
    containing the different sets"""
    
    assert np.sum(data_split)-1 < 0.01
    
    # work out how many datapoints will be in the train and validation sets 
    n_train = int(len((data_dict['X']))*data_split[0])
    n_validate = int(len((data_dict['X']))*data_split[1])
    
    # generate a random permutation of indices of the data and split into training, validation and test
    perm = np.random.permutation(range(len(data_dict['X'])))
    indices_train, indices_validate, indices_test = np.split(perm, [n_train, n_train+ n_validate])
    
    # create training, validation and test dictionaries 
    training_dict = {'X': data['X'][indices_train], 'Y': data['Y'][indices_train]}
    validation_dict = {'X': data['X'][indices_validate], 'Y': data['Y'][indices_validate]}
    test_dict = {'X': data['X'][indices_test], 'Y': data['Y'][indices_test]}
    
    return training_dict, validation_dict, test_dict


Solution 1:[1]

Your indices_train etc. are lists of indices which you want to create subarrays from. Use np.take:

# create training, validation and test dictionaries 
training_dict = {'X': np.take(data_dict['X'], indices_train), 'Y': np.take(data_dict['Y'], indices_train)}
validation_dict = {'X': np.take(data_dict['X'], indices_validate), 'Y': np.take(data_dict['Y'], indices_validate)}
test_dict = {'X': np.take(data_dict['X'], indices_test), 'Y': np.take(data_dict['Y'], indices_test)}

Additionally, data was replaced with data_dict.

Solution 2:[2]

When I have to split a dataset like this, I usually use sklearn.model_selection.train_test_split function. It will take a dataset and split it into train and validation sets. There will be one set to train the model with, one set to fit your model against, one set to use as test data and one set to use as test answers. The training data in the snippet below is called xtrain. The set your model should be fit to is ytrain. The set you can use as test data is called xtest and you can check how your model is doing and get scores by using the set called ytest. All of these sets are the same shape that your input was. There are different behaviors you can get out of train_test_split but the standard one is probably using it in default mode, which gives you a random sample of indices and the Y values' indices are kept in line with their respective X values, so you don't need to manage them, you just specify how big you want the test set to be; i.e. The percentage of your data do you want to train with vs test with.

here's some official docs on the function https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
...
def split_data(data_dict, data_split_percentage):
    assert data_split <= 100 # make sure we're dealing with percentage
    test_size = (data_split_percentage/100.0) # get [0.0,1.0] float
    xtrain, xtest, ytrain, ytest = train_test_split(data_dict['X'],
                                                    data_dict['Y'],
                                                    test_size=test_size)
    return xtrain, xtest, ytrain, ytest

# Use your function to fit a model then assert results are == answers
xtrain, xtest, ytrain, ytest = split_data(data_dict, 70)
model = ModelClass.fit(xtrain, ytrain)
estimates = model.estimate(xtest)

for i,estimate in enumerate(estimates):
    assert estimate == ytrain[i]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jeremy
Solution 2