'sklearn Kernel PCA with different order of samples

I've encountered a problem when I used kernel PCA implemented in sklearn. The order of the samples before kpca would significantly influence the classification accuracy. Here is my processing procedure:

  1. run Kernel PCA for input X(n_samples, n_components).
  2. shuffle the X, split the X into training set and test set (10-fold).
  3. use extratree classifier, svc or other classifiers implemented in sklearn to perform binary classification task.

my code

import numpy as np
from sklearn.utils import shuffle
from sklearn.decomposition import KernelPCA
import sklearn.metrics as metrics
from sklearn.model_selection import KFold
from sklearn.ensemble import ExtraTreesClassifier as etclf

# load data
datapath=r"F:\..."
data=sio.loadmat(datapath+"\\...")
x=data["x"]
labels=data["labels"]

# kernel pca
gm=[1e-5]
nfea=x.shape[1]
kpca=KernelPCA(n_components=nfea,kernel='rbf',gamma=gm,eigen_solver="auto",random_state=(42))
x_pca=kpca.fit_transform(x)

# shuffle the x_pca with labels
x_shuffle,y_shuffle=shuffle(x_pca,labels,random_state=42)    
data_label=np.concatenate((x_shuffle,y_shuffle),axis=1)

# 10-fold cross validation
kf = KFold(n_splits=10,shuffle=False)
for train,test in kf.split(data_label): 
    x_train=data_label[train,:-1]
    x_test=data_label[test,:-1]
    y_train=data_label[train,-1]
    y_test=data_label[test,-1]

    # binary classification prediction
    clf=etclf(n_estimators=10,criterion='gini',random_state=42)
    clf.fit(x_train,y_train)
    y_pred=clf.predict(x_test) 
    acc=metrics.accuracy_score(y_test,y_pred)

Before applying kernel pca, there are two kinds of orders of x :

  1. I sorted the x by their labels from 1 to 0 (i.e., 1111111...111000000...000), I finally got the accuracy close to 0.99.
  2. I shuffled the x with their labels (i.e., 1100011100...00101100100011), I finally got the accuracy about 0.50.

I also adopted other classifiers such as svc, gaussian naive bayes, the results were similar. I think it is not the matter of classifier or leakage between training set and test set. It is more likely that kpca makes high correlations between samples that are close in order. I don't know how to explain this result.

Thanks for help!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source