'Why is my sklearn SVC much slower when manually scaling instead of using the pipeline?

I want to build a VSC on the imdb data set with a TokenVectorizer. In the docs it says to scale the training/test data for better results. There is example code using a pipeline, but it should work with manual scaling as well.

import numpy as np
import datasets
import pandas as pd
import timeit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# prepare dataset
training_data = pd.DataFrame(datasets.load_dataset("imdb", split="train"))
training_data = training_data.sample(frac=1).reset_index(drop=True)
test_data = pd.DataFrame(datasets.load_dataset("imdb", split="test"))
test_data = test_data.sample(frac=1).reset_index(drop=True)

vect = CountVectorizer()
data = vect.fit_transform(training_data["text"].append(test_data["text"])).toarray()
train = data[0:25000,]
test = data[25000:50000,]

# find most frequent words in the comments
count = np.sum(train, axis=0)
ind = np.argsort(count)[::-1]

ks = [10]
#ks = [10, 50, 100, 500, 1000, 2000] # I want to compare the results for different k

# reduce features to the k most frequent tokens
# columns are already sorted by frequency desc
k_ind = ind[:max(ks)]
X = np.ascontiguousarray(train[:,k_ind])
y = training_data["label"]
test_set = np.ascontiguousarray(test[:,k_ind])

print(f"Check if X is C-contiguous: {X[:,:min(ks)].flags}")

# Test the execution time with pipeline first
for k in ks:
  clf = make_pipeline(StandardScaler(), SVC(C=1, kernel='linear', cache_size=4000)
  # only use k features
  t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
  print(f"Time with pipeline: {t}s")

# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X, copy=False)
scaler.fit(test_set)
scaler.transform(test_set, copy=False)

for k in ks:
  clf = SVC(C=1, kernel='linear', cache_size=4000)
  t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
  print(f"Time with manual scaling: {t}s")

This produces the output:

Check if X is C-contiguous:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
Time with pipeline: 58.852547400000276s
Time with manual scaling: 181.0459952000001s

As you can see, the pipeline is much faster, why is that the case? I want to test the classifier for different k, but then the pipeline and the scaler will be called multiple times on the same training data, will scaling the already scaled data return the same result or will it change each time I iterate (That's why I manually scaled and then slice the scaled data)?



Solution 1:[1]

Well, I simply forgot to save the scaled arrays...

<...>
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X, copy=False)
scaler.fit(test_set)
test_set = scaler.transform(test_set, copy=False)
<...>

Now it takes the same time for both approaches.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Flamingi