'Why is the scikit-learn SVM classifier running so long while using so little CPU?

I am executing the scikit-learn SVM classifier (SVC) from python 2.7.10 and it has been running for over 2 hours now. I have read the data using pandas.read_csv preprocessed it and then run

clf = SVC(C = 0.001, kernel = 'linear', cache_size = 7000, verbose = True)
clf.fit(X_train, y_train)

I have experience in running classifiers (Random Forests and Deep Neural Networks) in H2O using R and they never take this long! The machine I am running on has 16 GB RAM and and i7 with 3.6 GHz on each core. The taskmonitor tells me that 8.6 Gb RAM are being used by python, however, only 13% of the CPU. I don't quite understand why it is so slow and not even using all resources.

The data I have has 12000000 rows and 22 columns and the only verbose sklearn is giving me is one line:

[LibSVM]

Is that normal behavior or should I see a lot more? Could anyone post the verbose of a svc that finished? Also, can I do anything to speed things up besides lowering the C parameter? Using less rows is not really an option since I want to benchmark algorithms and they wouldn't be comparable if different training data was used. Finally, can anyone explain why so little of my resources are being used?



Solution 1:[1]

You can try using accelerated implementations of algorithms - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex

For SVM you for sure would be able to get higher compute efficiency, however for such large datasets this still would be noticeable.

First install package

pip install scikit-learn-intelex

And then add in your python script

from sklearnex import patch_sklearn
patch_sklearn()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1