'Where the categorical encoding should be done in a k fold - cv procedure?

I want to apply a cross validation method in my machine learning models. I these models, I want a Feature Selection and a GridSearch to be applied as well. Imagine that I want to estimate the performance of K-Nearest-Neighbor Classifier by applying a feature selection technique based on an F-score (ANOVA) that chooses the 10 most relevant features. The code would be as follows:

# 10-times 10-fold cross validation
n_repeats = 10
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)

# Data standardization
scaler = StandardScaler()

# Variable to contain error measures and counter for the splits
error_knn = []
split = 0


for train_index, test_index in rkf.split(X, y):
    
    # Print a dot for each train / test partition
    sys.stdout.write('.')
    sys.stdout.flush()
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Standardize the data
    scaler.fit(X_train, y_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    ###- In order to select the best number of neighbors -###  
    # Pipeline for training the classifier from previous notebooks
    pipeline = Pipeline([ ('knn', KNeighborsClassifier()) ])
    N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
    param_grid = { 'knn__n_neighbors': N_neighbors }

    # Evaluate the performance in a 5-fold cross-validation
    skfold = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=split)

    # n_jobs = -1 to use all processors
    gridcv = GridSearchCV(pipeline, cv=skfold, n_jobs=-1, param_grid=param_grid, \
                          scoring=make_scorer(accuracy_score))

    result = gridcv.fit(X_train, y_train)
    
    ###- Results -###
    # Mean accuracy and standard deviation
    accuracies = gridcv.cv_results_['mean_test_score']
    std_accuracies = gridcv.cv_results_['std_test_score']

    # Best value for the number of neighbors
    # Define KNeighbors Classifier with that best value
    # Method fit(X,y) to fit each model according to training data    
    best_Nneighbors = N_neighbors[np.argmax(accuracies)]
    knn = KNeighborsClassifier(n_neighbors = best_Nneighbors)
    knn.fit(X_train, y_train)
    
    # Error for the prediction
    error_knn.append(1.0 - np.mean(knn.predict(X_test) == y_test))
    
    split += 1

However, my columns are categorical (except binary label) and I need to do a categorical encoding. I can not remove this columns because they are essential.

Where would you perform this encoding and how the problems of categorical encoding of unseen labels in each fold would be solved?



Solution 1:[1]

Categorical encoding should be performed as the first step, precisely to avoid the problem you mentioned regarding unseen labels in each fold.

Additionally, your current implementation suffers from data leakage. You're performing feature scaling on the full X_train dataset before performing your inner cross-validation. This can be solved by including StandardScaler on the pipeline used for your GridSearchCV:

...

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

###- In order to select the best number of neighbors -###  
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline(
    [ ('scaler', scaler), ('knn', KNeighborsClassifier()) ]
)
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }

...

Another couple of tips:

  1. GridSearchCV has a best_estimator_ attribute that can be used to extract the estimator with the best set of hyperparameters found.
  2. When using GridSearchCV with refit=True (the default), you can use the object directly to perform predictions, e.g. gridcv.predict(X_test).

EDIT: Perhaps I was too general when it came to when to perform categorical enconding. Your approach should depend on your problem/dataset.

If you know beforehand how many categorical features exist and you want to train your inner CV classifiers with this knowledge, you should perform categorical enconding as the first step.

If at training time you do not know how many categorical features you are going to see or you want to train your CV classifiers without knowledge of the full range of categorical features, you should perform categorical enconding at each fold.

When using the former your classifiers will all be trained on the same feature space while that's not guaranteed for the latter.

If using the latter, the above pipeline can be extended to incorporate categorical encoding:

pipeline = Pipeline(
    [
        ('enc', OneHotEncoder()),
        ('scaler', StandardScaler(with_mean=False)),
        ('knn', KNeighborsClassifier()),
    ],
)

I suggest you read the Encoding categorical features section of scikit-learn's User Guide carefully.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1