'Why is my sklearn SVC much slower when manually scaling instead of using the pipeline?
I want to build a VSC on the imdb data set with a TokenVectorizer. In the docs it says to scale the training/test data for better results. There is example code using a pipeline, but it should work with manual scaling as well.
import numpy as np
import datasets
import pandas as pd
import timeit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# prepare dataset
training_data = pd.DataFrame(datasets.load_dataset("imdb", split="train"))
training_data = training_data.sample(frac=1).reset_index(drop=True)
test_data = pd.DataFrame(datasets.load_dataset("imdb", split="test"))
test_data = test_data.sample(frac=1).reset_index(drop=True)
vect = CountVectorizer()
data = vect.fit_transform(training_data["text"].append(test_data["text"])).toarray()
train = data[0:25000,]
test = data[25000:50000,]
# find most frequent words in the comments
count = np.sum(train, axis=0)
ind = np.argsort(count)[::-1]
ks = [10]
#ks = [10, 50, 100, 500, 1000, 2000] # I want to compare the results for different k
# reduce features to the k most frequent tokens
# columns are already sorted by frequency desc
k_ind = ind[:max(ks)]
X = np.ascontiguousarray(train[:,k_ind])
y = training_data["label"]
test_set = np.ascontiguousarray(test[:,k_ind])
print(f"Check if X is C-contiguous: {X[:,:min(ks)].flags}")
# Test the execution time with pipeline first
for k in ks:
clf = make_pipeline(StandardScaler(), SVC(C=1, kernel='linear', cache_size=4000)
# only use k features
t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
print(f"Time with pipeline: {t}s")
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X, copy=False)
scaler.fit(test_set)
scaler.transform(test_set, copy=False)
for k in ks:
clf = SVC(C=1, kernel='linear', cache_size=4000)
t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
print(f"Time with manual scaling: {t}s")
This produces the output:
Check if X is C-contiguous:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Time with pipeline: 58.852547400000276s
Time with manual scaling: 181.0459952000001s
As you can see, the pipeline is much faster, why is that the case?
I want to test the classifier for different k, but then the pipeline and the scaler will be called multiple times on the same training data, will scaling the already scaled data return the same result or will it change each time I iterate (That's why I manually scaled and then slice the scaled data)?
Solution 1:[1]
Well, I simply forgot to save the scaled arrays...
<...>
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X, copy=False)
scaler.fit(test_set)
test_set = scaler.transform(test_set, copy=False)
<...>
Now it takes the same time for both approaches.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Flamingi |
