'Cosine similarity and SVC using scikit-learn

I am trying to utilize the cosine similarity kernel to text classification with SVM with a raw dataset of 1000 words:

# Libraries
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# Data
x_train, x_test, y_train, y_test = train_test_split(raw_data[:, 0], raw_data[:, 1], test_size=0.33, random_state=42)

# CountVectorizer
c = CountVectorizer(max_features=1000, analyzer = "char")
X_train = c.fit_transform(x_train).toarray()
X_test = c.transform(x_test).toarray()

# Kernel
cosine_X_tr = cosine_similarity(X_train)
cosine_X_tst = cosine_similarity(X_test)

# SVM
svm_model = SVC(kernel="precomputed")
svm_model.fit(cosine_X_tr, y_train)
y_pred = svm_model.predict(cosine_X_tst)

But that code throws the following error:

ValueError: X has 330 features, but SVC is expecting 670 features as input

I've tried the following, but I don't know it is mathematically accurate and because also I want to implement other custom kernels not implemented within scikit-learn like histogram intersection:

cosine_X_tst = cosine_similarity(X_test, X_train)

So, basically the main problem resides in the dimensions of the matrix SVC recieves. Once CountVectorizer is applied to train and test datasets those have 1000 features because of max_features parameter:

Train dataset of shape (670, 1000)
Test dataset of shape (330, 1000)

But after applying cosine similarity are converted to squared matrices:

Train dataset of shape (670, 670)
Test dataset of shape (330, 330)

When SVC is fitted to train data it learns 670 features and will not be able to predict test dataset because has a different number of features (330). So, how can i solve that problem and be able to use custom kernels with SVC?

Solution 1:^[1]

So, how can i solve that problem and be able to use custom kernels with SVC?

Define a function yourself, and pass that function to the kernel parmeter in SVC(), like: SVC(kernel=your_custom_function). See this.

Also, you should use the cosine_similarity kernel like below in your code:

svm_model = SVC(kernel=cosine_similarity)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	akilat90

'Cosine similarity and SVC using scikit-learn

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]