'Cosine similarity and SVC using scikit-learn
I am trying to utilize the cosine similarity kernel to text classification with SVM with a raw dataset of 1000 words:
# Libraries
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
# Data
x_train, x_test, y_train, y_test = train_test_split(raw_data[:, 0], raw_data[:, 1], test_size=0.33, random_state=42)
# CountVectorizer
c = CountVectorizer(max_features=1000, analyzer = "char")
X_train = c.fit_transform(x_train).toarray()
X_test = c.transform(x_test).toarray()
# Kernel
cosine_X_tr = cosine_similarity(X_train)
cosine_X_tst = cosine_similarity(X_test)
# SVM
svm_model = SVC(kernel="precomputed")
svm_model.fit(cosine_X_tr, y_train)
y_pred = svm_model.predict(cosine_X_tst)
But that code throws the following error:
ValueError: X has 330 features, but SVC is expecting 670 features as input
I've tried the following, but I don't know it is mathematically accurate and because also I want to implement other custom kernels not implemented within scikit-learn
like histogram intersection:
cosine_X_tst = cosine_similarity(X_test, X_train)
So, basically the main problem resides in the dimensions of the matrix SVC recieves. Once CountVectorizer
is applied to train and test datasets those have 1000 features
because of max_features
parameter:
- Train dataset of shape
(670, 1000)
- Test dataset of shape
(330, 1000)
But after applying cosine similarity are converted to squared matrices:
- Train dataset of shape
(670, 670)
- Test dataset of shape
(330, 330)
When SVC
is fitted to train data it learns 670 features
and will not be able to predict test dataset because has a different number of features (330
). So, how can i solve that problem and be able to use custom kernels with SVC
?
Solution 1:[1]
So, how can i solve that problem and be able to use custom kernels with
SVC
?
Define a function yourself, and pass that function to the kernel
parmeter in SVC()
, like: SVC(kernel=your_custom_function)
. See this.
Also, you should use the cosine_similarity
kernel like below in your code:
svm_model = SVC(kernel=cosine_similarity)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | akilat90 |